f. 
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Remarks 



Claims 1-4, 6-1 1, 13, 16 and 41-44 have been amended. Upon entry of the 
foregoing amendments, claims 1-16, and 41-44 are pending. No new matter is added by 
these amendments. Support for the amendments may be found in the original claims and 
throughout the specification, e.g., at page 17, lines 17-20; page 18, lines 1 1-15; page 21, 
line 1 through page 26, line 26; page 46, line 1 through page 48, line 10; and Example 2. 

Applicant thanks the Examiner for retuming a copy of the initialed Form 1449, 
which was submitted with Applicant's Reply to Office Action filed on February 27, 
2003. 



/. Rejections under 35 U.S. C. § 112, First Paragraph (Enablement) 

Claims 1-16 and 41-44 stand rejected under 35 U.S.C. § 1 12, first paragraph, as 
containing subject matter that was not described in the specification in such a way as to 
enable one skilled in the art to which it pertains, or with which it is most nearly 
connected, to make and/or use the invention. Office Action at page 2. The Office alleges 
that "[w]hile the specification provides some guidance for a method of determining a 
probability value for the above listing using the particular equations or values disclosed, 
the specification does not provide guidance for a method of determining probability by 
any other means." Office Action at page 3. The office further alleges that "[gjiven the 
lack of descriptive working examples in the specification, and the unpredictability of 
generating probability values, the specification as filed is not enabling for any method of 
determining the listed probability values as claimed. The instant application is only 
enabled for the above-mentioned computational means of the four probabilities." Office 
Action at page 3. Applicant respectfully disagrees. 

Applicant thanks the Examiner for acknowledging that the specification is 
enabhng for the following equations: initial oligonucleotide probabiUty (p. 21, equation 
I), transition probability (p. 22, equation II), nucleic acid sequence probability (p. 23, 
equation III), and probability of each state for the nucleic acid sequence (p. 24, equation 
IV). Office Action at page 3. Applicant respectfiilly disagrees, however, with the 
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Office's allegation that the specification does not enable a person skilled in the art to 
practice the invention commensurate in scope with the claims. 

Disclosure of a single species provides sufficient enabling support if one of skill 
in the art can, using the state of the art and Applicant's written disclosures, practice the 
invention in its full scope without undue experimentation .^ See In re Wands, 858 F.2d 
731, 737, 8 U.S.P.Q.2d 1400, 1404 (Fed. Cir. m^^John Hopkins Univ. v. Cellpro. 
Inc., 152 F.3d 1342, 1361, 47 U.S.P.Q.2d 1705, 1719 (Fed. Cir. 1998) (Applicant's 
specification provided sufficient enabling support for the Applicant's claim to 
immunoassay methods using a generic class of antibodies even though Applicant made a 
public deposit of only a single hybridoma cell line that secreted a specific antibody); 
Spectra-Physics, Inc. v. Coherent, Inc., 827 F.2d 1524, 1533, 3 U.S.P.Q.2d 1737, 1743 
(Fed. Cir. 1987), cert, denied, 484 U.S. 954 (1987). Section 2164.03 of the M.P.E.P. 
states that "[a] single embodiment may provide broad enablement in cases involving 
predictable factors,^ such as mechanical or electrical elements." Citing In re Vickers, 141 
F.2d 522, 526-27, 61 U.S.P.Q. 122, 127 (C.C.P.A. 1944); In re Cook, 439 F.2d 730, 734, 
169 U.S.P.Q. 298, 301 (C.C.P.A. 1971). Furthermore, it is well estabhshed law that 
patent applicants are not required to disclose every species enabled by their claims. See 
In re Vaeck, 947 F.2d 488, 496, 20 U.S.P.Q.2d 1438, 1445 (Fed. Cir. 1991). 

Applicant need only show that one skilled in the art would be able to make and 
use the claimed invention using the application as a guide. In re Brandstadter, 484 F.2d 
1395, 1406-07, 179 U.S.P.Q. 286, 294 (C.C.P.A. 1973). hi order to be enabling, the 



' Applicant notes that the performance of routine and well-known steps cannot create undue 
experimentation even if it is laborious. See In re Wands, 858 F.2d at 737, 8 U.S.P.Q.2d at 1404; In re. 
Angstadt, 537 F.2d 498, 504, 190 U.S.P.Q. 214, 218-219 (C.C.P.A. 1976). Time and difficulty of 
experiments are not determinative if they are merely routine. M.P.E.P. § 2164.06, page 2100-186. 

^ AppUcant respectfully disagrees with the Office's impHed assertion that determining probabilities using 
pre-existing statistical methods is unpredictable. The Office states, "[gjiven the lack of descriptive working 
examples in the specification, and the unpredictability of generating probability values, the specification as 
filed is not enabling for any method of determining the listed probability values as claimed." Office Action 
at page 3 (italics added). Applicant respectfully submits that determining probability values using pre- 
existing statistical methods {i.e., knovm mathematical equations) is not unpredictable. Applicant 
respectfully requests that the Office provide legal or other support for the assertion that generating 
probability values is "unpredictable." Office Action at page 3. 



9 



ppln. No. 09/698,213 
James D. MCININCH 

specification need not disclose what is well-known to those skilled in the art and 
preferably omits that which is well known to those skilled and already available to the 
pubHc.^ See, e.g., M.P.E.P. § 2164.05(a), page 2100-185, citing In re Buchner, 929 F.2d 
660, 661, 18 U.S.P.Q. 2d 1331, 1332 (Fed. Cir. 1991); HybritecK Inc. v. Monoclonal 
Antibodies, Inc., 802 F.2d 1367, 1384, 231 U.S.P.Q. 81, 94 (Fed. Cir. 1986), cert, denied, 
480 U.S. 947 (1987); md Lindemann Maschinenfabrik GMBH v. American Hoist and 
Derrick Co., 730 F.2d 1452, 1463, 221 U.S.P.Q. 481, 489 (Fed. Cir. 1984). 

Applicant respectfully submits that the specification as filed is enabling for the 
full scope of the claims. The specification describes, and provides working examples for, 
the use of inhomogeneous Markov models to determine the probabilities for each of the 
one or more states for a selected nucleotide. See, e.g., specification at pages 19, line 7 
though page 27, line 6, and Examples 1 through 3. As such, specification provides 
sufficient support to enable one of skill in the art, using the state of the art and the 
specification disclosure, to practice the invention in its full scope without undue 
experimentation. 

Moreover, although Applicant respectfully maintains that no additional 
information is needed to enable the full scope of the claims, the specification also 
provides that "[a]ny probabihty model applicable to nucleic acid sequence state 
probabilities can be used for the probability steps if the output of the probability model 
sufficiently supports the method, including inhomogeneous Markov models having fewer 
than eight states." See specification at page 19, lines 22-24. The specification also points 
that skilled artisan to Durbin et al, Biological Sequence Analysis (1998), described at 
page 19, lines 26-27 of Applicant's disclosure."^ Applicant respectfully asserts that for at 
least these reasons, the specification as filed provides adequate guidance to enable one of 



Applicant respectfully submits that the Office has failed to provide any evidence to suggest that the 
statistical methods taught by Durbin are not well knovra in the art. Applicant points the Office to the 
Bibliography of Durbin, which cites over 200 references written by a diversity of authors. Durbin, pages 
326-344. 

4 

Copies of the Bibliography of Durbin, as well as Durbin Chapters 5 and 11, are enclosed for the 
Examiner's convenience. See Exhibit A. 
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skill in the art to practice the invention using additional statistical methods that would be 
substitutable for the four equations that the Office has determined to be enabled. 

Moreover, Applicant respectfully disagrees with the Office's implied assertion 
that the material referred to in Durbin is "essential material." Office Action at page 4; 
and Office Action mailed January 13, 2003 at page 5. The M.P.E.P. defines "essential 
material" as including "that which is necessary to provide an enabling disclosure of the 
claimed invention." M.P.E.P. § 608.0 l(p), page 600-79. 

Applicant respectfiiUy submits that the material in Durbin is not "essential" 
because it is not necessary to provide an enabling disclosure of the claimed invention. As 
stated above, disclosure of a single species provides sufficient enabling support if one of 
skill in the art can, using the state of the art and Applicant's written disclosures, practice 
the invention in its full scope without undue experimentation . See In re Wands, 858 F.2d 
at 737; John Hopkins Univ.. 152 F.3d at 1361; Spectra-Physics. Inc, 827 F.2d at 1533; 
M.P.E.P. § 2164.03. Furthermore, it is well established law that patent applicants are not 
required to disclose every species enabled by their claims. See In re Vaeck, 947 F.2d at 
496. 

The Office alleges that "Applicant's reliance on prior art methods may only 
extend to well known methods and that single specific publications do not support their 
content as being well known." Office Action at pages 4-5. Applicant disagrees. 
Applicant reiterates that the Office has not provided evidence to suggest that the methods 
of Durbin are not well known in the art. See footnote 3 infra. Applicant respectfiilly 
submits that the Office has offered no legal support for the assertion that "single specific 
publications do not support their content as being well known." Furthermore, 
Applicant's citation of a single reference, rather than a list of references, cannot property 
be used as evidence that the information contained therein is not well known in the art. 
After all, requiring patent applicants to cite a list of cumulative references would 
contravene the well-known principle that the specification need not disclose what is well- 
known to those skilled in the art and preferably omits that which is well known to those 
skilled and aheady available to the public. M.P.E.P. § 2164.05(a), page 2100-185. 
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For at least the foregoing reasons, Applicant respectfully asserts that the 
specification as filed enables a person of skill in the art to practice the invention 
commensurate in scope with the claims. Applicant respectfiiUy submits that the rejection 
of claims 1-16 and 41-44 under 35 U.S.C. § 1 12, first paragraph is improper and should 
be withdrawn. Reconsideration and withdrawal of these rejections are respectfiilly 
requested. 

Should the Examiner maintain this rejection based on the contention that the 
material disclosed in Durbin is not well known to those of ordinary skill in the art, 
Applicant respectfully requests that the Examiner support this contention by way of 
affadavit in accordance with 37 C.F.R. § 1.104 (d)(2). 

//. Rejections under 35 U.S. C. § 112, Second Paragraph (Indefiniteness) 

Claims 1-16 and 41-44 stand rejected under 35 U.S.C. § 1 12, second paragraph as 
being allegedly indefinite for failing to particularly point out and distinctly claim the 
subject matter which Applicant regards as the invention. Office Action at page 5. 

(a) Rejection of claims 3 and 11 

Claims 3 and 1 1 stand rejected under 35 U.S.C. § 1 12, second paragraph on the 
grounds that they contain mathematical equations that are allegedly confiising as they 
incorporate '"0(f)' representing bias which cancels itself out in each equation, and 
therefore nullifies its effect on the equation." Office Action at page 5. The Office further 
alleges that "[i]f the Applicant intends this bias not to represented [sic] by the same exact 
number in the numerator and denominator, then subscripts, or some other form of 
notation, would be needed in order to clarify this issue." Office Action at page 5. 
Applicant respectfully disagrees. 

Applicant respectfully disagrees that "(I>(f)" (representing bias) cancels itself out 
of the equation. Applicant respectfully points out that "0(f)" corresponds to a function, 
and as such, "0(f)" can have different numerical values corresponding to different 
elements in the set of states. See, e.g., specification at page 47, lines 13-20. As 
acknowledged by the Examiner, when "0(f)" has different numerical values 
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corresponding to different elements in the set of states, "<I)(f)" has different values in the 
numerator and denominator of the equations in claims 3 and 11, and hence "0(f)" does 
not cancel out. Compare, e.g., calculation at page 46, lines 1-5 with calculation on page 
48, lines 5-10. Applicant therefore disagrees that bias cancels itself out of the equation. 

Applicant also respectfully disagrees with the Office's assertion that subscripts or 
other notation are required to clarify this issue. Applicant respectfully submits that 
acceptability of the claim language depends on whether one of ordinary skill in the art 
would understand what is claimed, in light of the specification. M.P.E.P. § 2173.05(b). 
Applicant points out that the specification clearly defines "0(f)" as a function. See, e.g., 
specification at page 24, lines 4-5 and lines 18-25. Applicant respectfully submits that 
one of skill in the art would understand that a function may be assigned different values 
under different circumstances, and would also understand that Example 2 illustrates that 
the values substituted for "0(f)" do not cancel out of the equation. Compare, e.g,, 
calculation on page 46, lines 1-5, with calculation on page 48, lines 5-10. Applicant 
therefore respectfully submits that one of skill in the art would understand the meaning of 
"0(f)" in light of the specification, and that no subscripts or notations are necessary to 
clarify the issue. 

For the foregoing reasons. Applicant respectfully asserts that the specification 
contains guidelines sufficient to teach the meaning of the claim language "0(f)" to one of 
ordinary skill in the art, and thus, the rejection of claims 3 and 1 1 under 35 U.S.C. § 1 12, 
second paragraph is improper and should be withdrawn. Reconsideration and withdrawal 
of this rejection is respectfully requested. 

(b) Rejection of Claims 1, 7, 8, and 41-44 

Claim 1 stands rejected under 35 U.S.C. § 1 12, second paragraph, on the grounds 
that it recites the phrase "said probability of said nucleic acid sequence" which is 
allegedly "vague and indefinite due to the lack of clear antecedent basis for the noted 
phrase in part d) of claim 1." Office Action at page 5. The Office further alleges that 
"[t]his lack of antecedent basis and unclear wording is also present in other independent 
claims 7, 8 (regarding part d) said window probability), 41, 42 (part a) probability of a 
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window), 43, and 44 (part d) said window probability). This rejection is also applicable 
to claims 2-6 and 9-16 which are claims dependent from said independent claims due to 
their direct or indirect dependence." Office Action at page 6. Applicant respectfully 
disagrees. 

Applicant disagrees that there is a lack of clear antecedent basis for the phrase 
"said probability of said nucleic acid sequence." However, in order to facilitate 
prosecution. Applicant has amended claim 1. 

Applicant further disagrees that there is a lack of antecedent basis and unclear 
wording in claims 7, 8, and 41-44. Applicant respectfully submits that the specification 
defines "window" as "a contiguous and defined number of nucleotides within a nucleic 
acid sequence." See, e.g., Specification at page 17, lines 17-20. Applicant also directs 
the Office to page 25, line 25-26 of the specification, which states "[i]n order to 
determine the state probabilities for more than one nucleotide, a window is used for each 
nucleotide that is examined." Applicant therefore submits that one of ordinary skill, 
reading the claims in light of tiie specification and in light of his or her knowledge of the 
art, would understand the meaning of the phrase "said window probability." When read 
in light of the specification, the phrase "said window probability" is no less 
understandable than the phrase "said initial oligonucleotide probabiHty." However, in 
order to facilitate prosecution. Applicant has amended claims 7, 8, 41, and 44. 

Applicant therefore submits that the grounds for the rejection of Claim 1, 7, 8, and 
41-44 has been rendered moot. Applicant fiirther submits that the amendments to claims 
1, 8, and 41-44 has also rendered moot the rejections of dependent claims 2-6 and 9-16. 
In light of these remarks. Applicant respectfiilly requests withdrawal of these rejections. 

(c) Rejection of Claims 1, 7, 8, and 41-44 

Claims 1, 7, 8, and 41-44 stand rejected under 35 U.S.C. § 112, second paragraph, 
on the grounds that they recites the phrase "based upon" which allegedly renders unclear 
"the metes and bounds of the parameters that that determine how much basis is included 
upon the determinations." Office Action at page 6. The Office fiirther alleges that 
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"[cjlaims 2-6 and 9-16 are also indefinite due to their dependency from claims 1 and 8." 
Office Action at page 6. Applicant respectfully disagrees. 

Applicant disagrees that the phrase "based upon" renders unclear the metes and 
bounds of the claim. However, in order to facihtate prosecution, Applicant has amended 
claims 1,7,8, and 41-44. 

AppUcant therefore submits that the grounds for the rejection of Claim 1, 7, 8, and 
41-44 has been rendered moot. AppHcant further submits that the amendments to claims 
1, 8, and 41-44 has also rendered moot the rejections of dependent claims 2-6 and 9-16. 
In light of these remarks. Applicant respectfully requests withdrawal of these rejections. 

(d) Rejection of Claim 7 

Claim 7 stands rejected under 35 U.S.C. § 112, second paragraph, on the grounds 
that it recites the term "capable", which allegedly is a relative term that renders the claim 
indefinite. AppUcant respectfully disagrees. 

Even if "capable" were a relative term, the use of a relative term does not make a 
claim per se indefinite. Seattle Box Co. v. Industrial Crating & Packing, Inc., 73 1 F.2d 
818, 826, 221 U.S.P.Q, 568, 574 (Fed. Cir. 1984); M.P.E.P. § 2173.05(b). Breadth in a 
claim is not to be equated with indefmiteness. In re Miller, 441 F.2d 689, 169 U.S.P.Q. 
597 (C.C.P..A 1971); M.P.E.P. § 2173.04. The words of a claim must be given their 
plain meaning unless they are defined in the specification. M.P.E.P. § 21 1 1.01, page 
2100-47. 

For at least these reasons, Applicant submits that Claim 7 is not indefinite in the 
recitation of "capable." However, in order to facilitate prosecution, Applicant has 
amended Claim 7. Applicant therefore submits that the grounds for the rejection of 
Claim 7 has been rendered moot. Li light of these remarks. Applicant respectfully 
requests withdrawal of this rejection. 

(e) Rejection of Claims 3 and 11 

Claims 3 and 1 1 stand rejected under 35 U.S.C. § 1 12, second paragraph, on the 
grounds that they are allegedly "vague and indefinite due to the lack of clarity in the 
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following terms: f, S, Pf, Pj, and O." Office Action at page 7. The Office further alleges 
that "a clarification of the metes and bounds is required, by listing in the claim the exact 
definition of each term in order to make clear whether definitions from the art should be 
utilized or those in the specification since, as argued by Applicant, art defined (not 
specification defined) methods are apparently heavily relied upon by Applicant." Office 
Action at page 7. Applicant respectfully disagrees. 

The test for determining whether terms in a given claim are indefinite is whether 
one skilled in the art would understand what is claimed. Amgen, Inc. v. Chugai 
Pharmaceutical Co., Ltd, 927 F.2d 1200, 18 U.S.P.Q.2d 1016 (Fed. Cir. 1991), cert 
denied, 112 S.Ct. 169 (1991). M.P.E.P. § 2173.02 states that "[d]efmiteness of claim 
language must be analyzed, not in a vacuum, but in light of: (A) The content of the 
particular apphcation disclosure; (B) The teachings of the prior art; and (C) The claim 
interpretation that would be given by one possessing the ordinary level of skill in the 
pertinent art at the time the invention was made." 

Under M.P.E.P. § 2173.02, the meaning of the terms f, S, Pf, Pi, and <D must be 
determined in light of factors (A) through (C) listed above. Applicant respectfully 
submits that one of skill in the art would understand that f, S, Pf, Pi, and O correspond to 
terms, or parts of terms, of a mathematical equation. Applicant further directs the Office 
to pages 21-25 of the specification, and Examples 1-2. Applicant respectfully points out 
that they know of no legal requirement to list "the exact definition of each term" within 
the claim, and respectfully requests that the Examiner state the legal basis which is relied 
upon for this statement. 

For at least these reasons. Applicant submits that one of ordinary skill in the art, 
when reading the claim terms f, S, Pf, Pi, and ^ in light of the specification and the 
teachings of the prior art, would understand what was meant by Claims 3 and 11. 
Therefore, Applicant respectfully requests that the indefiniteness rejections of claim 3 
and 1 1 under 35 U.S.C. § 1 12, second paragraph, be withdrawn. 



16 



'ppln. No. 09/698,213 
James D. MCININCH 

09 Rejection of Claims 8 and 44 

Claims 8 and 44 stand rejected under 35 U.S.C. § 1 12, second paragraph, on the 
grounds that they allegedly lack clarity due to the claim language "determining a 
probabihty for said window for each of said states. Claims 9-16 are also indefinite due to 
their dependency from claim 8." Office Action at page 7. AppUcant respectfully 
disagrees. 

The Office alleges that "a probability cannot be determined for a window, but 
rather the states found in the window." Office Action at page 7. Applicant respectfully 
disagrees. As stated above, under M.P.E.P. § 2173.02, the meaning of the phrase 
"determining a probability of said window for each of said states" must be determined in 
light of (A) The content of the particular application disclosure; (B) The teachings of the 
prior art; and (C) The claim interpretation that would be given by one possessing the 
ordinary level of skill in the pertinent art at the time the invention was made. Applicant 
respectfully submits that the specification defines "window" as "a contiguous and defined 
number of nucleotides within a nucleic acid sequence." See, e.g.. Specification at page 
17, line 17. Applicant therefore submits that one of ordinary skill, reading this phrase in 
light of the specification and his or her knowledge of the art, would understand the 
meaning of the phrase "determining a probability of said window for each of said states." 
When read in light of the specification, the phrase "determining a probability of said 
window for each of said states" is no less understandable than the phrase "determining an 
initial oligonucleotide probability." However, in order to facilitate prosecution, claims 8 
and 44 have been amended. 

Applicant respectfully submits that, in light of the above arguments, the grounds 
for the rejection of Claims 8 and 44 has been overcome or rendered moot. Applicant 
further submits that the rejections of dependent claims 9-16 has also been overcome or 
rendered moot. In light of these remarks, Applicant respectfully requests withdrawal of 
these rejections. 
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///. Rejections under 35 U.S. C. § 102(b) 

Claims 1, 4, 5, 7-9, 12, 13, 15, and 41-44 stand rejected under 35 U.S.C. § 102(b) 
as being allegedly anticipated by Borodovsky et al. (Computers Chem., 1993). The 
Office alleges that "Due to the confusion (see 35 U.S.C. 1 12, paragraph rejection 
above) of "0(f)" effectively canceling itself out in the equations of claims 3 and 1 1, these 
equations are equivalent to the equations listed on page 129 (Borodovsky et al.). Being 
equivalent equations, if one probability (as provided by Applicant) is "capable of 
accepting a bias" (claim 7, line 10), then the same probability stated by Borodovsky et al. 
(page 129) must also be capable of accepting a bias. Therefore, Borodovsky et al. 
anticipate the instant invention." Applicant respectfully disagrees. 

As noted above. Applicant respectfully disagrees that "0(f)" (representing bias) 
cancels itself out of the equation. Applicant respectfully points out that "0(f)" 
corresponds to a function, and as such, "0(f)" can have different numerical values 
corresponding to different elements in the set of states. See, e.g., specification at page 47, 
lines 13-20. As acknowledged by the Examiner, when "0(f)" has different numerical 
values corresponding to different elements in the set of states, "0(f)" has different values 
in the numerator and denominator of the equations in claims 3 and 1 1, and hence "0(f)" 
does not cancel out. For example. Applicant points the Office to Example 2, pages 46- 
48 of the specification, which illustrates that the values substituted for "0(f)" do not 
cancel out of the equation. Compare, e.g., calculation at page 46, lines 1-5 with 
calculation on page 48, lines 5-10. AppUcant therefore disagrees that bias cancels itself 
out of the equation. 

Applicant respectfully disagrees that claims 1, 4, 5, 7-9, 12, 13, 15, and 41-44 are 
anticipated under § 102(b) by Borodovsky. As noted by the Examiner, anticipation under 
§ 102(b) requires that every element of a claim appears in a single reference. Applicant 
respectfully asserts that claims 1, 4, 5, 7-9, 12, 13, 15, and 41-44 each contain the 
function "0(f)" (or the phrase "bias fimction"), which is lacking in Borodovsky. 
Applicant respectfully submits that claims 1, 4, 5, 7-9, 12, 13, 15, and 41-44, as amended 
herein, are therefore not anticipated by Borodovsky. AppHcant therefore respectfully 
requests withdrawal of the rejections under 35 U.S.C. §102(b). 
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In addition, the Office alleges that "if one claim is 'capable of accepting a bias' 
(claim 7, line 10), then the same probability stated by Borodovsky et al (page 129) must 
also be capable of accepting a bias." Office Action at page 8. Applicant respectfully 
submits that claim 7 has been amended, and that, in light of the amendment of claim 7, 
the grounds for the rejection of claim 7 have been overcome or rendered moot. In light of 
these remarks. Applicant respectfully requests withdrawal of the rejection of claim 7. 
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Conclusion 



In view of the above, each of the presently pending claims is believed to be in 
immediate condition for allowance. Accordingly, the Examiner is respectfully requested 
to withdraw the outstanding rejections of the claims and to pass this application to issue. 
The Examiner is encouraged to contact the undersigned at (202) 942-5512 should any 
additional information be necessary for allowance. 



Date: August 12. 2003 

ARNOLD & PORTER 
555 Twelfth Street, N.W. 
Washington, D.C. 20004-1206 
(202) 942-5000 telephone 
(202) 942-5999 facsimile 



Respectfully submitted, 




Rachel L. Adams (Reg. Attorney No. 54,660) 
David R. Marsh (Reg. Attorney No. 41,408) 
Holly Logue Prutz (Reg. Attorney No. 47,755) 
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"^^^^ AAAAAAAT^AAAAAAAA BBBBBBBBBBBBBBBBCCCCrrrrrrr 

K^-™ VLSPADKTmnCAAWGKVGA.-HAGEYGAEAL^FLlF^TKTYFPHF 

^G-S^ ™eeksavtalwgkv----nvdevggealgrllvwpwtqrf^^^^ 

S chSp vlsegewqlvlhvwakvea-dvaghgqdilirlfkshpetlekfdrf 

GLB3_CHITP LSADQISTVQASFDKVKG DPVGILYAVFKADPSIMAKFTOP 

rLBi-GLYm g^^tesqaalvkssweefna-nipkhthrffilvleiapaakdlfs-f 

rnnbntn= glsaaqrqviaatwkdiagadngagvgkdclikflsahpqmaavfg-f 

consensus ls v a W kv . . g . L. . f . p . f F 

Helix DDDDDDDEEEEEEEEEEEEEEEEEEEEE FFFFFPPFPPPP 

HBA_HUMAN -DLS HGSAQVKGHGKKVADALTNAVAHV— D-DMPNALSALSDLHAHKL 

HBB_HUMAN GDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHL---D--SSlSElSl" 
MYG PHYCA KHLKTEAEMKASEDLKKHGVWLTALGAILKK----K-GmELKpS^?S" 
GLB3_CHITP AG-KDLESIKGTAPFETHANRIVGFFSKIIGEL--P---NIEADmSp^ 
GLB5_PETMA KGLTTADQLKKSADVRWHAERIINAVNDAVASM--DDTEKMSMKLRD^gSsF 

Consensus ^^"":-*^~°^^^^<^^vlaqigvavshl-gdegkmvaqmkavgvrhkgygn 

consensus . t - . . v. .Hg kv. a a...l d . a 1. 1 h . 

Helix FFGGGGGGGGGGGGGGGGGGG HHHHHHHHHHHHHHHHHHHHHHHHhh 

HBA.HUMAN "IWDPyNFKLLSHCLLVTLAAHLPAEFTPAV™^^ 



:;^™^^G^^YHX?^«^GKEFTPPVQAA^QK^^^ 



b¥s 

GLB5_PETMA -QVDPQYFKVLAAVIADTVAAG dagfeklmsmicillrsay 

J^^?-™^ — "^^ahfpvvkeailktikewgakwseelnsawtiaydelaivikkemndaa— 

Coniens!s '^^'^^^f ^^^^^^^samehriggkmnaaakdawaaayadisgalisglS^— 
onsensus v. f 1 f . aa. k. . 1 sky 



Figure 5.1 An alignment of seven globins fmm Bashford. Chothia & 
Lesk [1987]. To the left is the protein identifier in the SWiss-PROT 
database [Bairoch & Apweiler 1997]. The eight alpha helices are shown as 
A-H above the alignment. A consensus line below the alignment indicates 
residues that are identical among at least six ofthe seven sequences in upper 
case, ones identical in four or five sequences in lower case, and positions 
where there is a residue identical in three sequences with a dot. 



them, and certain residues are particularly strongly conserved. When identifying 
a new sequence as a globin, it would be desirable to concentrate on checking that 
these more conserved features are present. How to obtain and use such informa- 
tion will be the subject of this chapter. 

. . As might be expected, our approach to consensus modeUing will be to make 
a probabiUstic model. In particular, we will develop a particular type of hid- 
den Markov model weU suited to modelling multiple alignments. We call these 
profile HMMs after stzndaid profiles, which are closely related non-probabilistic 
' stmctures introduced previously for die same purpose by Gribskov, McLachlan 
& Eisenberg [1987]. Profile HMMs are probably the most popular application of 
■hidden Markpv models in moleciJar biology at die inoment - ? 

' " We will assume for the purposes of tiiis chapter that we are^ given a correct 
multiple aUgnment, from whichiwe.will build a model that can-be-iised to find 
and score, potential matches to new sequences. The multiple aUgnment could 
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be built from structural information, like the globin alignment shown here, or it 
In ChapteTe ' ^^'^"^"^^"''^^^ ^"g"'"-"^ P^^^^edure, such as those discussed 

Much of this chapter makes use of the theory presented in Chapter 3 for general 
HMMs The most important algorithms will be presented again in the specific 
fon. relevant to profile HMMs. There is also an extensive discussion of h^^; to 
estimate optimal probabiHty parameters from multiple sequence alignments. 

5.1 Ungapped score matrices 

One general feature of protein family multiple alignments, which can be seen 
in Figure 5.1, is that gaps tend to Une up with each other, leaving solid blocks 
where there are no insertions or deletions in any of the sequences. We will start 
by considering models for these ungapped regions 

model for such a region would be to specify independent probabilities e,(a) of 
ob^rvmg ammo acid a in position / (we use letter . because these will nlrn out 

°' ^hen we introduce 
gaps). The probability of a new sequence x according to this model is then 



P(x\M) = Y[e,(x,), 



1 = 1 



iw'" t . 1 ' " in fact more 

interested m the ratio of this probability to the probability of . under a random 
model, and so to test for membership in the family we evaluate die log-odds ratio 



5 = Elog 



eiixi) 



1=1 



Tie values log^ behave like element i„ a seoie matrix s(a.b). where the 
^Mta ,s posidon „ther than .mi„„ acid For dds reasta, s"hl 
a^h K taown .^ nposlrion ^cifc score malri, (PSSM). A PSSM can be 

^ S ^eLn'tT"' " ' ' "^"^ " "'--•^S dK, 



^r ^^^^^^^ and delete States to obtain profile HMMs 

-Although aPSSM captures some conservationinfomiation, it is clearly an inad- 
equate representation of all the infonnatlon in a multiple aligmnent of a protein 
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5.2 '^'''gmser,a,uldeU:,eMU:„„obu,i„p„jiUHMM, 103 

The approach we take is to build a hidden Markov model (HMM^ wifh . . 
itive structure of states but diffprenf nr^K t, i v («MM), with a repet- 

vide a Ml P»*.b,,M; m^eTrrrcit Z '""^^ ™' 

Off by „b»„i„, a., p^s^ i:^: 

idenUcal sates lhat we will call ,7 ' "f 

bility I. » «"U call ^ch stales, separated by transitions of proba- 

EEHZh - • 0*R 

e'::^:r::^estir::r:<r-- 
d,e ^L^-w^tXTs ;rn~f 'r "iir " -^-^ 

insetiom, after the residue matchin. ,h?^ '. . "here I, wdl be nsed to maKh 
The I, have emission dSwb™™ f M to ,h ""^ '"^■«- 

ground dislribntion „ ' ^ "°™^'' "> ^ 

^e.,.aceonnn«,^„„lti.resla„einse,.l"a,^',\^ir^ 

M,+, . Here ,s a single insert state of this Bnd: ' 
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Detete i.e. segments of U,e multiple alignment that are no. matched by 

neighbounng match states: 




lot If r v'^"'^ '^"^ ^'P^ " ^ require a 

lot of transitions. Instead we induce silent states D, as descrilid in Section 3.4! 




Because the silent states do not emit any residues, it is possible to use a sequence 
of them to get fiom any match state to any later one. between two residues in 

M D transition followed by a number of D -> D transitions, then a D -> M 

TTZ :T ""^^^ "'^"^""^ ^« -ert. although 

ti.e path through the model looks different. In detail, it is possible that t^e D ^ D 

fransit.onswmhavediffei^ntprobabihties,andhencecontributediffei.ntlyto 
core, whereas aU the I I transitions for one insert involve the same stie 1 
so are guaranteed to have the same cost estate, ana 

The fuU resulting HMM has the structure shown in Figure 5 2 This form 

r^93?aidT' r n ''"''^ ^^^^^ ^ ^ ^ 

[1993] Krogh et al. [1994]. We have added transitions between insert and 

delete states, as they did, although these are usually very improbable LeavSg 

'rdi:rrr:r^-^-^---^-^^^ 



■'Mr 




'^'"^^'»''^^^of<ipwfileHMM. We me diamonds to 
'"d'cate the insert states and and cirvlesfor the delete states. . .. 
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We have seen how the costs of u " 

used in pairwise alignment with affineZsThH ' "'f ' "^"^ ™ 
it is useful to consider the degenerate case wL t "'"^ relationship. 
Which we build the HMM contains ju^t oZlaZ ""^'^^^ ^"^"'"^"^ 

Let us compare Figure 5.2 with 4ure 4 2 Tr^^^ . 
then Figure 5.2 is an unrolled versi H Rg j T2 ,H T^^^^^ ^^"^"^ ^■ 
commg from a separate copy of the Sffl t^' ''^^^'^"^ ^^ch 

sequence of match statesMLi,tcores^- ' ^^^^P-" ^ a 

to incarnations of Y. To achieve a^ cloZ ^T 'T'^"^°''' 
values for the match emission prbabrjrr^^^^^ 
probabilities of seeing a given in a p^i ^^^^tional 
probabilities = ^ /^"^ ^^^^ ^gnmem, and for the transition 

; In formal terms our profile HMM is ^^7.?' ^' T ' ' • 
-tained by conditioning! paS Jf ^ "^"^"^ ob- 

one of the sequences in its aIign™B!f °" ^'^"^"g sequence as 

finding the most p„,bable digfT'^f " *^ '^r 

same as those for the most pfZbTe alt^e " T '" ^"^^^^^y *e 

ta. 7.e key We. S tSL^™*- '™ - 
slmcOM as shown in Rg^ 5 2 ta sJI " same 

,*e Whole toily. EssenMy, we ZZZT h ^« °f 

, -'*^'«anumberofdifferemwav,r7 '^f^''^""'"'"'- 

: W shown i, Figure 5.^ - r""" globto align- 



Non-prokbilistic profiles 



■H-..n..I).Howe.e..ey^.rLrar™r^-^^^^ 
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HBA^HUMAN 

HBB_HUMAN 

MYG_PHYCA 

GLB3_CHITP 

GLB5_PETMA 

LGB2_LUPLU 

GLB1_GLYDI 



.VGA--PIAGEY. 

. V NVDEV . 

.VEA-^-DVAGH. 

.VKG D. 

.VYS--TYETS. 
.FNA--NIPKH. 
. lAGADNGAGV . 



Figure 5.3 Ten columns from the multiple alignment of seven globin protein 
sequences sho.n tn Figure 5.1. The starred columns are or^s ZZi be 
treated as matches ' in the profile HMM. 

but rather directly assigned position specific scores for each match state and .an 
penalty for use in standard 'best match' dynamic programm^g C"et Z 

^Z1:2Z:T " ^^^^^^-'^ standL^uTsJStit 

scores from all the residues seen m the corresponding multiple alignment column 
For example, they would set the score for residue /in colL i~^7e 

fj(V,a) + ij(F,a)+ij(i,a) 

where sifl,h) is the standard substitution matrix. They also set gap penalties for 
each olumn using a heuristic equation that decreased the cost of a gap (e^e 
« or deletion) according to the length of the longest gap obse^^ Ste 
multiple alignment spanning the column. 

^^^t^^TT rr^'^^'^ ^'^^^"^ *° '^^-bine information, and it 
itlr r ^^^^ "members of families 

lotZa T wfh r""? " ^""^^^^^ - that 

m column 2 If we had an aligmnent with 100 sequences. aU with a cysteine (c\ 

s^uTie ^^''7""^" ^'^^'^^y *e same as would be derived from a single 
S^^^^ correspond to our expectation that the likelihood'f a 

7Jteine should go up as we see more confirming examples 

In ad^bon to these observations about substitution scores, the scoi^ for aaos 
do not behave as expected. For example, from the aUgnment in fS^ ?ff 
score for a deletion would be set to be the same in coLmn 2 ^Sere if I 

ope^g m five of the^ seven sequences. It.would be more reasonable to set the 
pro>bUity.of^^ew:gap^pening..to 
Changes: have been made, to non-prohabiUstic profiles, to addresr^^e and 
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Basic profile HMM parameterisation 

Let us turn back to hidden Markov model profiles Like all HMMc u 
emission and transition Drohah.-iit;»c A • . ^"^^ "^Ms, these have 
^. a profile HMl^ can 1 ^tT^"' *" "'"'"'""''^ ^ 

sequences. Hie aim of Zt^^^ dislnbuaon over the whole space of 

Tbe parameters we have available lo control the shaie ^- , ■,. ■ 
.he values of the probabiUties, and also the length oftTmlT^ '""? 
say about setting these optimally. We sive hen, .h, i^,!^ model There is a lot to 
C19^,. A«ersec.„nson<ulsefe::LTgtlSr^<,S"^f"*« 
.we^-^l-e^tuto.extendeddiscnsslonofalteLtivc":^^^^ 

.0 inset, ^i:^Tz:r^^Tr"- T " 

.had a match state for each tesidue v H™,„ , t °* '^'^ > 

clear that the consensus seou^™ for'th ' " ^ '^ 

and that the two non-sS^dttolr '^:,T^r"'''"^' 
insertion with resoect fo fh. ""^^^-^^^^I should be treated as an 

insert. A simple rule^To iXr" ^ T" 

charactets should be m^IewtyT^;; ""^ *™ ""^^ap 

au«:;rrga:::fitrr^"'*"'''^'''-'^-^*^ 

:di«%usinge,„a;r;.?3rC~We'T'^^^*^ 
.^eachtransitionoremissLisurrfsi^^^^^^^^ 



and = 



these and 



r.~itdtrrrr'°""^;*"='^'-^«<>-<'«-»i*" 

TMn the lim f «f K corresponding ftequencies. 

Nitsta^^vtr^^sr!^*^"'""^^-^ 

r*.er,i,hasp,oMe,rX2?lrv.?^ 

r*«-some ■rat.iUon..^ S;^ne1r^'J---'^'» 
.»d so would acquire ^r„ probahihty/wUch^rreT^^.tTb: 
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to avoid zero probabiUties we r^,n .hh / mimmal approach 

to add one to each frc^.n^ W^Z^'^''''''' ""T' '"P'^'^'^ 
values, and other apprLhes TL^aZ^ pseudocoum 
inSection5.6. *^P«ters, at greater length below 

Example: Parameters for an HMM based on Figure S3 

Let us assume that we use Lanlace'*: mi^ • 

2/27, and «„ fa) - ,m f™ .11 j '"''^ = ^^-^u,m = eu,(Fy = 

«M.M. =7/10'/: - ii /Li^r - iXi^ 

transitions from n^teh to match a^ LT i <f°"°^"»g column 1 there are six 
and no insertions). R^e74^Lwl^^^^^ '° ' '^'^'^ ^ «BB_human, 
in diagrammatic form ^ 



we are looking for global matches. 



5.4 Searching with profile HMMs 
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In practice, as for pairwise alignment, one of the local alignment methods may be 
more sensitive for finding distant matches. We discuss these in the next section. 

We have a choice of ways to score a match to a hidden Markov model. We 
can either use the Viterbi equations to give the most probable alignment jt* of a 
sequence x together with its probability P{x,n*\M), or the forward equations to 
calculate the full probability of x summed over all possible paths P{x\M). 

In either case, for practical purposes the result we want to consider when eval- 
uating potential matches is the log-odds ratio of the resulting probability to the 
probability ofx given our standard random model 

We therefore show here versions of the Viterbi and forward algorithms that are 
designed specifically for profile HMMs, and which result directly in the desired 
log-odds values. Note that changing to log-odds does not change the result; we 
could have subtracted the random model log score afterwards. However, it is 
pleaner and more efficient. Another practical reason for working in log-odds 
units is to avoid problems of underflow when working with raw probabiUties, as 
we discussed in Section 3.6. 



Viterbi equations 

Let VM(i) be the log-odds score of the best path matching subsequence xi, , to 
the submodel up to state j, ending with Xi being emitted by state My. Similarly 
Vj{i) is the score of the best path ending in xi being emitted by Ij, and VP{i) for 
the best path ending in state Dy . Then we can write 



V}(i) = 



log — 1- max 



, eijiXi) 

log hmax 



V;^_,0-l)-f-logai^_,M;, 

V;^,0-l)-^logao,_,M,; 
VM(/-i)+ioga^.,., 

V/(/-l)-|-loga,^,.. 

VjP(i-l)+logaoj,j; 



(5.1) 



VP(i) = 



max 



V)*!,(0 + lOgaM;_,D,, 

Vj_i(i)+logai._,o., - ^ 

^-l(')+10g«D;..D,. 

These are the general equations/ In a tyiiical case,' there is no emission score 
e^Oci) in the equation for V/(i) because we assume that the emission distribution 
ft^m the insert states I,- is the same as the background distribution, so the proba- 
mties cancel in the Iog-odds-form.^o,-theD.^ I and r-^-D transition terms 
giay not be present, as discussed above. ; a . v ■■ 
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=«.re» aU depending o„ p„;M„„ inl tlty 

Forward algorithm 
The recurrence equations for the forward algorithm are sinular to fh. Vt 

Fj(0 = I«g — +logK_,M,exp(/;M(,._j)^ 

^^^'^ = '°S^+'°gKi,exp(FM(/-i)) 
.o,., , +'^'>vexp(f;'(/-l))+a.,,exp(/rO(,_i))j. 
O(') = ^°gK-,D;exp(^;M _^ exp(/;?_,0)) 

+ «Dy-,D,exp(/;P,(0)J. 

a: . . Alternatives to log-odds scoring 
LU.) - log I A/). The LL score is strongly length depen^^^^ 
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Figures^ To the left the Ungth-nornializedLL score is shown as afimction 
of sequence length. The right plot shows the same for the log-odds score 

Jit is not good enough to use a simple threshold. It is better to use LL divided by 
^toe sequence length, but even that is not always perfect, because the dependence 
, Ween LL and sequence length is not linear (see example below) 

A way to get around this is to estimate an average score and a standard de- 
viauon as a function of length and then use die number of standard deviations 
each sequence ,s away from the average. This is called the Z-score, and is also 
illustrated m the example below. 

Example: ModeUing and searching for globins 

From 300 randomly picked globin sequences a profile HMM was estimated from 

m Chapter 6. A simple pseudocount regulariser was used. The estimation was 
.done several tunes and the model witii the highest overall LL score was picked 
,OVe used the default settings of the SAM package, version 1.2; Hughey 

'kH^T'^l ' ^"^^"^ °^ ^ ^ P™'«i^« (SWiss-PROT release 34- 
^'T .^^""''"^ ^'^'^^ ^^^^'^^ algorithm THe ll 

j^d log-odds scores were found for each sequence. For the null model we used 
fte ammo acid frequencies of the 300 sequences in the training set. In Figure sl 
, toe length-normalised scores are shown for all the globins in the training set all 
^^^^^^ and aU the rest of ^ proteins witi:^ ^ 
^ ammo acids.' The globin sequences are clearly separated from thVnon 

^lojjins apart from a few in the 'twilight zone.' , : . 

I^main difference between the two is in the variance <rf the scor^for non- 
gbms. which IS lower for the log-odds scoi., and therefore the separation is 
However, just cho^sji^a cut-off of zero for the log-oclds would miss a 



■ ■ i. 



..A few dubious globins and other strange seque;K=es were rem^edfom Oiese data^ 
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Figure 5.6 77. Z-score calculate, fi.m tl. ZZ scores (left) an, tHe lo.-oMs iri.Ht). 

lo calculate Z-scores, a smooth curve is fitted to th^ I T j ^ 

deviation is then estim^^f^H f^. «o u i . , ^ Liyy4JJ. A standard 

□ : 

Alignment 

. . ZT 'T^y- »s Pnmanly the subject of the next chapter, 

. a good approximation. ^r^ ^^7:T,^^ '^\^^^^i^^^ 
reasonable, as discussed in Chapter2r^ ! ^"'^"'r.^^'e chsttibution may be more 
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on multiple alignment methods, which covers alignment with profile HMMs at 
length. For now, we will just point out that the natural solution is to take the high- 
est scoring, or Viterbi, alignment. This is obtained by tracing back on the Viterbi 
variables exactly as with pairwise alignment. Beyond this, all the methods 
of Chapter 4 can be applied, to explore variants, and to assess the reUability of 
the alignment. 



5.5 Profile HMM variants for non-global alignments 

We have seen that there is a very close relationship between the Viterbi alignment 
of a sequence to a profile HMM and the global dynamic programming compar- 
ison between two sequences using affine gap penalties, which we described in 
Chapter 2. It is therefore possible to generalise all the variations of dynamic pro- 
granuning, such as those that find local, repeat and overlap matches, to use profile 
HMMs. 

However, we have developed probabilistic models much more fully since 
Chapter 2, and this time we want to take more care to ensure that the result of 
converting to a local algorithm remains a proper probabilistic model, i.e. that 
we assign each sequence a true probability so that the sum over all sequences 
Yij,P{x\M)=\. Our approach to doing this is to specify a new model for the 
complete sequence x, which incorporates the original profile HMM together with 
one or more copies of a sunple self-looping model that is used to account for the 
regions of unaUgned sequence. These behave very like the insert states that we 
added to the profile itself. We call ±em flanking model states, because they are 
used to model the flanking sequences to the actual profile match itself. 

The model for local (Smith-Waterman style) aUgnment is shown here: 




The flanking model states are shown as shaded diamonds. Notice that as well 
as specifying the emission probabilities of the new states, which will normally 
-of course be^^, we must specify a numberof new transition probabilities. 'The 
looping probabiUty on the flanking states should be close to 1, since they must 
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account for long stretches of sequence. Let us set these to (1 - ;;). Note also that 
we have made use of silent states, shown as shaded circles, as 'switching points' 
to reduce die total number of transitions. 

The next issue is how to set all the transition probabilities from the left flank- 
mg state to different start points in the model. One option is to give them equal 
probabilities, i)/L. Another is to assign more probability to starting at the begin- 
mng of the model. The default option in the hmmer package for profile HMMs 
[Eddy 1996] assigns probability r,/2 to the start of the profile, and ;j/(2(L - 1)) 
to the other positions, favouring matches that start at the beginning of the model. 

If all the probability is assigned to the first model state, then it forces this model 
to match only complete copies of the profile in the searched sequence, ensuring 
a type of 'overlap' match constraint. This can be appropriate when, for example 
the HMM represents a protein domain that you expect to find either present as a 
whole or absent. However, to allow for rare cases where the first residue might be 
missing, it may be wise in such cases to allow a direct transition from the flanking 
state mto a delete state, as shown here: 




B«gin 



















End 

















^.^^ ^.u, »y uii^cuiig wim me ffansmon connections and probabilities a 
wide vanety of different models can be produced, each potentiaUy useful in dif- 
ferem circumstances. A final example similar to the first model for local matches 




Which aUows repeat matches to subsections of the profile model, like the repeat 
algorithm variant in Chapter 2. . . 

ir Note that aU these variants of transition connectivity and probability assign-- 
ment affect not only the types of match that aie aUowed, but also the score. More 
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restrictive transition distributions will give higher match scores if a good match is 
found so are preferable if they can be designed to represent the types of co J 
matches that are expected. ^urrecc 

Exercises 

5. 1 Show that if the random model is the same as that described in Chapter 4 
(a succession of two states looping on themselves with probability (1 - 
n)), with the same as in the flanking models, the local alignment model 
gives update equations like those of equation (2.9). 
Explain the reasons for any differences. 



5.2 



5.6 More on estimation of probabilities 

As Fo^sed above, we now return to the subject of parameter estimation at 
greater length. Although our discussion for most of this section will be focusi 
on d.e emission probabilities, analogous methods can be used for the transition 
probabilities. The aim here is to introduce methods that can be used. A mo^de 
tailed mathematical discussion about the estimation of probabilites from sample 
counts IS given m Chapter 1 1 (p. 31 1). ^ 
^ The most straightforward approach to parameter estimation would be to give 

«T"™ *^ We will change notation 

shghtly from Uiat used before. Given observed frequencies c,„ of i^sidue a Z 
position ; of tiie alignment, maximum likelihood estimates forl.(a), the o j 
spending model parameters, are ' 



eMj(a) = 



(5.2) 



As we descnbed above, a clear problem with this is that if there are no observed 
exainples of a particular outcome then its probability is estimated as zero. IMs 
wtil frequentiy occur. For example, in tiie aligmnent of Figure 5.3 only v, I and 
Fa^epresentmthe first column. However, it is quite likely that other aLi acTd^ 
wm occur m that position amongst all the other globin sequences in biology. n,e 

' ! Pseudocount approach at greater lengtii, 

. tlien give some more complex alternatives. . . 



Simple pseudocounts 



1^ pseudocount method is to add a constant to aU the 

l^unts, which prevents the problem^th zero pxobabilities.rWhen the constant is — 
I one, as we used above in our example, this is caUed ^Laplace's rule'. A slight!^ 
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ofj corresponds to Aqa in the example above. One set might be relevant for ex- 
posed loop environments, one for buried small residue environments, etc. Given 
our counts cja we first estimate how likely each prior distribution k is (based on 
how well it fits the observed data), then combine their effects according to these 
posterior probabilities: 



ja' 



+<') 



where the P(k\Ci) are the posterior mixture coefficients. We calculate these by 
Bayes' rule, 

PkP(Cj\k) 



P{k\Cj) = 



where the pk are the prior probabilities of each mixture component, and P(Cj\k) 
is the probability of the data according to Dirichlet mixture k. The equation for 
P(Cj \k) has a frightening looking form, which is in fact fairly simple to calculate: 

where r(x) is the gamma function, a standard function over the reals related to the 
factorial function on the integers. For further details and an explanation of this 
equation, see Chapter 11, where we also describe how the mixture component 
distributions orj are obtained. 

Using this type of approach, it seems that good profile HMMs can be fit to 
alignments with as few as ten or twenty examples [Sjolander et al 1996]. 



Substitution matrix mixtures 

An alternative approach to using a mixture of Diiichlets is to adjust the pseudo- 
counts in a single Dirichlet formulation, using mformation from the observed 
counts and a substitution matrix. This is not a theoretically well-founded ap- 
proach, but it makes intuitive sense as a heuristic, combining features of the non- [ 
probabilistic profile methods and the Dirichlet pseudocount methods. | 
^. The first step is to convert the matrix entries s{a,b) into conditional proba- 
bilities P{b\a). If we assume that the substitution matrix entries are derived as 
log-odds ratios, as in Chapter 2, then s{a,b) = log(P(a,b)/qaqb\ which is the | 
same as log(P(fe|a)/P(fe)), so P(b\a) = qbe'^^'^K We can in fact derive P(b\a) | 
values from an arbitrary score matrix s(a,b) given background probabilities qa\ 
see below. 

Given conditional-probabilities P(b\a) we can generate pseudocpunts as fol- 
lows. Let fja be the maximum likelihood probabilities derived from the counts, 



' ^ ^ ^ Profile HMMsfor sequence families 

so fja = cj,/ Cj^.. Using these we set pseudocount values with 

b 

Where ^ is a positive constant comparable to the one we used with simple nseudo 

7m WeT * ''''' ^'^'^^ '"''-^ Henilc:;;^^'^^ 

pa": """"'"^ - (5-3) to obtain the mode! 

^M,(a) = -£i£±5if_ 

There is no obvious statistical interpretation for this type of pseudocount hnf 
the Idea . quue natural: amino acid / contributes to pseudocount y tpTp^^^^^^^^^ 
^.ts abundance m the column and the probability of its changing to Zo acM? 
TTie formu a interpolates between the treatment of pairwise alfgnmTrald the 
maximum hkehhood solution. Tl.e substitution maL term doTnate^ "Lt 
are small numbers of sequences fesoeciallv Jf 4 n ^ """ares ir tiiere 

(more precisely when the tottj number of co-nts C- » M ^ 

Deriving P(b\a)from an arbitrary matrix 

Even if a score matrix s{a,b) was not derived as a log-odds matrix as 1n„a . 
tain conditions are fuimied it is possible to findascfle^a^^^^^^^ 
Will behave correctly when interpreted as a log-odds matrix [^rhuU 99 1 T^^^^ 
conditions are that the matrix is negatively biased i e T TTT T I ! 
that it contains at least one positive entry. '"^"'^"'^^ ^ ^' 

What we want is a set of values nj for which 

5(a,6) = llog-^, 
^ ^aqb 

where r^, can be interpreted as the probability for the pair a,b This equation is 



One such value is A = 0, hi,, clearly ftis^Ton^rw^, The .wo condidons 
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we gave above turn out to be sufficient to ensure there is another, positive solution 
to this equation; see the exercises below. 

The resulting value of A. is called the natural scaling factor of the substitution 
matrix. This probabilistic interpretation of the substitution matrix leads to an en- 
tropy measure for the matrix of Ylab^ab ^og(rab/qa^b)^ which is a useful quantity 
for characterising and comparing substitution matrices [Altschul 1991]. 

Exercises 

5.3 Use the negative bias condition to show that f(k) is negative for small 
enough X, Hint: calculate /'(O), the derivative of f(k) at A = 0. 

5.4 Use the second condition, that there is at least one positive s{a,b), to 
show that f(X) becomes positive for large enough X. 

5.5 Finally, show that the second derivative of f(X) is positive, and from this 
and the results of the previous two exercises that there is one and only 
one positive value of X satisfying (5.4). 

Estimation based on an ancestor 

There is a more principled and direct way to use the information in substitution 
matrices for estimating the HMM probabilities than that described above. This 
approach does not use pseudocounts. Instead, it assumes that all the observed se- 
quences have been derived independently from a common ancestor, and generates 
an estimate of the residue present in a given position in that common ancestor (or 
rather a posterior probability distribution for what that residue was). From this 
we can estimate the probability of seeing each residue in a new descendant of the 
ancestor, different from those in the sample. 

Assume we have example sequences with residues Xj in column j of the 
alignment (we have adjusted our notation slightly; this Xj is not the jih residue 
in sequence jc* if there are gaps, but it is a convenient notation for what we need 
here). Once again, we need the conditional probabilities P(b\a) derived from the 
substitution matrix. Let the residue in the conunon ancestor be yj. Then we can 
use Bayes rule to calculate the posterior probability that yj ~ a 

P(yj = a laugnment) — 



(5.5) 



.Note that, we needed a prior distribution for residues at the common ancestor, 
^which we set to qa because that is our background probability for amino acids in 
;the absence of further information. 

/ij; We can now calculate the HMM emission probabilities as the predicted proba- 
bilities for a new sequence 



" ^j(a) =^ P(a|d V(>V = «'|aligMnent). 



(5.6) 
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One problem with this approach is that, as we noticed above, different columns 
vary widely in their degree of conservation. Indeed, that is one of the proper- 
ties that we wanted to exploit when using alignments to estimate profile HMMs 
However, using a single substitution matrix implies assuming a fixed degree of 
conservation. As we discussed in Chapter 2, matrices typically come in families 
varymg m their level of impUed conservation. Examples are the pam [Dayholf 
Schwartz & Orcutt 1978] and the BLOSUM [Henikoff & Henikoff 1992] series 
of matnces. We can therefore significantly improve the approach in (5.5) and 
(5.6) if we optimise over choice of matrix from a family. This way, a very con- 
served column might use a conservative matrix, such as PAM30, and a very varied 
column would use a divergent matrix, such as PAM500. 

How do we choose the optimal matrix? A natural approach is to maximise the 
likelihood of the observed data 



nx],...,x^\t) = Y,qaX\P{x)\a,t) 



(5.7) 



where / is the matrix family parameter (f for evolutionary time). It would also be 
possible to use a Bayesian approach here, proposing a prior distribution over t 
then combining this with (5.7) in Bayes' rule to obtain a posterior distribution for 
and summing over this in (5.6). However, that would require signficantly more 
computation. 

The maximum likelihood time-dependent approach is closely related to the 
'evolutionary weights' method in the prohle package [Gribskov & Veretnik 
1996]. However, that method estimates different evolutionary times t for each 
possible ancestral amino acid, and also adjusts the resulting weights with respect 
to a set of baseline probabiUties; for details see Gribskov & Veretnik [1996] 
There are also strong connections between the methods of this subsection and 
those discussed later in Chapter 8 when building phylogenetic trees using maxi- 
mum likelihood methods. 



Testing the pseudocount methods 



All the methods mentioned above have been tested in various ways. Direct tests 
m which profiles were constructed and used for searching, were carried out ex- 
tensively by Henikoff & Henikoff [1996]. The best method turned out to be the 
substitution matrix based method (5.6), with A = 5« as described above; the 
Dinchlet mixture regulariser came a reasonably close second. Other tests gave 
different results [Tatusov. Altschul & Koonin 1994; Kaiplus 1995], so it is not 
clear which method is best, and it is likely that this will depend on the application 
and the details of the mixture components or substitution matrix used. 

An interesting-method-was for testing various regularisers was given by 
Karplus [1995]. Instead of performing a huge number of database searches, he 



5.6 More on estimation of probabilities 



121 



laximise the 



asked the following question: How well can an amino acid distribution be ap- 
proximated from a small sample? Columns were extracted from a large set of 
deep alignments (the blocks database; Henikoff & Henikoff [1991]). Imagine 
we take a small sample of size n with counts Ca from a column with complete 
counts Ca- From the sample counts Ca we can estimate the frequencies eda) of 
other symbols that might occur in the same column, using one of the methods 
described above (we use a subscript c to remind ourselves that this estimation 
is dependent on the sample counts). We can now calculate the log likelihood of 
the other symbols that actually do occur, as Yla^^a ~ Ca)logec(a). Note that we 
do not score the counts that were used in the sample. We can now repeat this 
process for all samples of size n from all columns and sum all the resulting log 
likelihoods. 



LL= ^ J2 J2^Ca-Ca)logec{a). 

columns C samples c a 



(5.8) 



Karplus proposed that a good regulariser should maximise this value. Further- 
more, he pointed out that there is clearly an optimal strategy for such as process, 
which is to tabulate for each possible set of sample counts Ca the maximum like- 
lihood distribution emission distribution e^a). This is only practically possible 
to do explicitly up to n = 5. 

Karplus showed that several of the more complex regularisers described above 
resulted in estimators that were very close to optimal, in the sense that the differ- 
ence in LL values from optimal was very small at w = 5. Of course, ultimately 
we are interested in database searches, and it is not evident that the regulariser 
obtaining the lowest LL score will actually be best for searching. It is likely that 
the typical similarities in the source alignment database are not the same as the 
ones that we will be searching for with our HMM. 

L The actual averaging over all samples can only be done explicitly for sample 
sizes up to around n = 5, but that is also the most interesting regime, because 
it is for small sample sizes that regularisation is most crucial. For larger sample 
. sizes we would have to use some form of random sampling method to estimate 
the average. 

iScAs well as evaluating methods, Karplus' approach can also be used to set the 
^ free parameters in the various methods described above, for example the total 
Inumber of pseudocounts A to use in (5.3). For any value of A we can calculate 
Ithe average log likelihood from our database of colunms, either directly or by 
[ some sort of random sampling, and in fact we can also calculate the gradient of 
^ the relative entropy with respect to A. We can therefore find the value of A that 
I minimises this average relative entropy, using gradient descent methods [Press 
let ah 1992], or by other optimisation methods. Of course, in principle this can 
^be done for any sample size n, yielding parameters dependent on n. However, 
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because the averaging is difficult for n > 5, this technique has yet to prove its 
potential. 



5.7 Optimal model construction 

When we first discussed the parameterisation of profile HMMs, we pointed out 
that as well as estimating the probabihty parameters, it is necessary to decide 
which columns of the alignment should be assigned to insert states, and which to 
match states. We call this process model construction. At the time we proposed a 
simple heuristic, but we can do better than that. There is an efficient dynamic pro- 
gramming algorithm which can find the column assignments that maximise the 
posterior probability of the mode, at the same time as fitting optimal probability 
parameters. 

In the profile HMM formalism, it is assumed that an aligned column of sym- 
bols corresponds either to emissions from the same match state or to emissions 
from the same insert state. It therefore suffices to mark which columns come 
from match states to specify a profile HMM architecture and the state paths for 
all the sequences in the aUgnment, as shown in Figure 5.7. In a marked column, 
symbols are assigned to match states and gaps are assigned to delete states. In 
an unmarked column, symbols are assigned to insert states and gaps are ignored. 
State transition and symbol emission counts are obtained from the state paths, 
and these counts can be used to estimate probabihty parameters by one of the 
methods in the previous section. In passing, we note that this model estimation 
procedure implicitly assumes that the multiple alignment is correct, i.e. that die 
impUed state paths have probabihty one and all other state paths have probabihty 
zero, which is akin to a Viterbi assumption. The next chapter addresses issues of 
simultaneous alignment and model estimation. 

There are 2^ combinations of markings for an alignment of L columns, and 
hence 2^ different profile HMMs to choose from. There are at least three ways 
to determine the marking. In manual construction, the user marks alignment 
columns by hand. This is perhaps the simplest way to allow users to manually 
specify the model architecture to use for a given alignment. In heuristic construc- 
tion, a rule is used to decide whether a column should be marked. For instance, 
a colunm might be marked when the proportion of gap symbols in it is below a 
certain threshold. In MAP construction, a maximum a posteriori choice is deterr 
mined by dynamic programming. A description of this algorithm follows. :D 

MAP rnatch-insert 

TheMARconstniction algorithm recursively calculates a number 6), ~which is the 
log probability of the optimal model for ttie aUgnment up to and including column 



5. 7 Optimal model construction 



123 



;t to prove its 



e pointed out 
ary to decide 
and which to 
ve proposed a 
dynamic pro- 
maximise the 
al probability 

lumn of sym- 
to emissions 
olumns come 
;tate paths for 
irked column, 
lete states. In 
s are ignored, 
le state paths, 
by one of the 
lei estimation 
t, i.e. that the 
ve probability 
5sses issues of 

columns, and 
ist three ways 
rks alignment 
s to manually ^ 
istic construcT 
For instance, 
I it is below a J 
hoiceis deter?^ 
bllows. 

■ - • ■ % 'it^*^^ • 

. ■ -M - 

.which is the! 
luding column I 



(a) Multiple alignment: 

X X 

bat 
rat 
cat 
gnat 
goat 



... X 
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(c) Observed emission/transition counts 

model position 
0 12 3 



(b) Profile-HMM architecture: 





Figure 5.7 As an example of model construction from an alignment, a small 
DNA multiple alignment is given (a), with three columns marked above with 
xs. These three columns are assigned to positions 1-3 in the model ar- 
chitecture (b). The assignment of columns to model positions determines 
the symbol emission and state transition counts (c) from which probability 
parameters would he estimated. 

j, assuming that column j is marked. Sj is calculated from smaller subalignments 
endmg at a marked column / (i < j) by incrementing 5,- with the summed log 
probabibty of the transitions and emissions for the columns between / and J The 
relevant probabihty parameters are estimated 'on the fly' from the counts that 
are unpbed by marking columns i and j while leaving unmarked the intervening 
'columns (if any). 

Transition and emission counts for a section of aligmnent bounded by marked 
|polumns I and y are independent of how columns are marked before i and after y 
Pius making a dynamic programming recursion possible. Only marked columns 
considered m the recursion, because transition and emission counts for un- 
aark«I columns are not independent of the assigmnent of neighbouring columns- 
^gle msert state may account for more dian one column in the aUgnment ' 
^or mstance, let Tij be the summed log probabiUty of all the state transitions 
•etween marked columns / and j. We can determine 7J/ from the observed state 
i^siUon counts C;,, and the probabiMesr^y: ; ^ ,. - 
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Transition counts c^y are obtained from the partial state paths implied by mark- 
ing , and J. For instance, if in one sequence we see a gap in column / five 
residues in columns / + 1 to y - 1, and a residue in column j, we would count 
one delete-insert transition, four insert-insert transitions, and one insert-match 
fransition. The transition probabilities a,, are estimated from the c,, in the usual 
fashion, possibly including Dirichlet prior terms a,, (or indeed, any form of prior 
that is mdependent of the marking outside of y ): 



ajcy = 



Cxy 4- Uxy 



Y^yCxy+Clxy' 

Let Mj be the analogous log probabiUty contribution for match state symbol 
emissions in column j, and be the same for the insert state emissions for 

columns ^ + 1, . . . , y - 1 (for y - / > 1). We can now give the algorithm: 

Algorithm: MAP model construction 

Initialisation: 

So = 0, Ml+x = 0. 
Recurrence: for y = 1, . . . , L + 1 : 

cj = argmax 5,- + Ttj + Mj + ,_, + a.. 

0<i < j 

Traceback: From j = a^+i, while y > 0: 
Mark column y as a match column- 
J=aj. 

A profile HMM is then built from the marked alignment. The extra term k is 
a penalty used to favour models with fewer match states. In Bayesian terms k is 
the log of the prior probability of marking each column, implying a simple but 
adequate exponentially decreasing prior distribution over model lengths 

Jith some care in implementation, this algorithm is 0(L) in memory and 
U{L ) m time for an alignment of L columns. 



5.8 Weighting training sequences 

One issue that we have avoided completely so far is that of weighting sequences 
when estimating parameters. In a typical aligmnent, there are often some se- 
quences that are very closely related to each other. Intuitively, some of the in- 
formation from these sequences is shared, so we should not give them each the 
same mfluence in the estimation process as a single sequence that is more highly 
diverged from alLtheoto.._In. the extreme that two sequences are identic^ it 
makes sense that they should each get half the weight of other sequences, so that 



5.8 Weighting training sequences 



125 



lied by mark- 
)lumn I, five 
would count 
insert-match 
; in the usual 
form of prior 



ts = 3 



t,= 2 



4=2 



t. = J 



2 3 4 



L V I 



Figure 5.8 On the left, a tree of sequences with branch lengths. On the 
right, the corresponding * current* and ^voltage' values used in the 'Kirch- 
hoff's law' approach to sequence weighting (see text). 



the net effect is of having only one of them. Statistically, the problem is that typ- 
ically the examples we have do not constitute a good random sample from all the 
sequences that belong to the family; the assumption of independence is incorrect. 
To deal widi this sort of situation, there have been a large number of proposals for 
different ways to assign weights to sequences. In principle, any of these can be 
used in combination with any of the methods of the preceding sections on fitting 
model parameters and model construction. 



Simple weighting schemes derived from a tree 

Many weighting approaches are based on building a tree relating the sequences. 
Since sequences in a family are related by an evolutionary tree, a very natural ap- 
proach is to try to reconstruct this tree and use it when estimating the independent 
contribution of each of the observed sequences, downweighting sequences that 
have only recently diverged. We discuss phylogenetic tree construction at length 
later in Chapters 7 and 8, as well as in the next chapter on multiple sequence 
alignment. For our current purposes, the fine details of the method are probably 
not too important, and we will assume that we are given a tree connecting the 
sequences, with branch lengths indicating the relative degrees of divergence for 
each edge in the tree, 
fep One of the intuitively simplest weighting schemes [Thompson, Higgins & Gib- 
I son 1994b] can be expressed nicely as follows. We are given a tree made of a 
f conducting wire of constant thickness and apply a voltage V to the root. All the 
I leaves are set to zero potential and the currents flowing from them are measured 
^ and taken to be the weights. Clearly, the currents will be smaller in the highly di- 
! vided parts of the tree so these weights have the right qualitative properties. They 
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can be calculated by applying Kirchhoff 's laws. For instance, in the tree shown in 
Figure 5.8, let the current and voltage at node n be /„ and V„, respectively. Since 
constant factors do not affect the calculation, we can set the resistance equal to 
the edge-time. We then find Vs = 2h = 2/2, V6 = 2/i + 3(/i + I2) = 5/3, and 
V7 = 8/4 = 5/3 + 3(/i + /2 + /3). There are therefore three equations relating the 
four currents, and these give I\ : /2 : /3 : /4 = 20 : 20 : 32 : 47. 

Another attractively simple idea was proposed by Gerstein, Sonnhammer & 
Chothia [1994]. Their algorithm works up the tree from the leaves, incrementing 
the weights. Initially the weight of a sequence is set equal to the edge-time of the 
edge immediately above it. Now, suppose node n has been reached. The edge 
above n has edge- time tn, and this is shared out amongst the weights of all the 
sequences at the leaves below n, incrementing them by a fraction proportional to 
their current weight values. Formally, the increase Awi in a weight wi is given 
by 

Awi = tn = : . (5.9) 

Z^leaves k below n 

The same operation is carried out up to the root. 

This is clearly an easy and efficient algorithm. For instance, the weights in the 
tree of Figure 5.8 are computed as follows: Initially the weights are set to the 
edge lengths of the leafs, wi=W2 = 2, W3 = 5, and 11^4 = 8, At node 5 the edge 
length of 3 above node 5 is shared out equally to wi and W2, giving them 3/2 
each, so now wi = W2 = 2 + 3/2 = 3.5. At node 6 we find the edge of length 3 
above node 6 is shared out to nodes 1, 2 and 3 in the ratio 3.5 : 3.5 : 5, making 
u;i = 102 = 3.5 + 3 X 3.5/12, and 1^3 = 5 -I- 3 x 5/12. With W4 = 8, this gives 
w;i : u;2 : 1^3 : w;4 = 35 : 35 : 50 : 64. Even though these weights are close to those 
given by the Kirchhoff rule, the methods are in a sense opposed, for in a tree with 
two leaves and one edge longer than the other, the longer edge is down weighted 
by Kirchhoff and up weighted by (5.9). 



Root weights from Gaussian parameters 

One view of weights is that they should represent the influence of leaves on the 
root distribution. It is possible to make this idea precise, as Altschul, Carroll & 
Lipman [1989] showed. They built on the version of Felesenstein's 'pruning' 
algorithm which applies to continuous parameters [Felsenstein 1973]. Instead of 
discrete members of an alphabet we have a continuous real- valued variable, like 
the weight of an organism. In place of a substitution matrix we have a probability 
density that defines the probability of substituting one value, jc, of this variable 
by another, -ly.HA^simple-example.ofrSuch a'density is a Gaussian,-^where-the 
probability of x ^ y along an edge with time t is cxp(-(x — yy^/(2ah^). The 
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Figure 5.9 The tree described in the text when deriving Gaussian weights. 

pruning algorithm now proceeds exactly as for a finite alphabet, but with integrals 
replacing discrete sums [Felsenstein 1973].^ 

Felsenstein's algorithm yields a Gaussian distribution for the parameter in 
question at die root whose mean jx depends linearly on the values jc,- of the param- 
eters at die leaves, so ju- = WiXi, Altschul, Carroll & Lipman [1989] proposed 
that these wj, should be used as weights. They represent the influence of each leaf 
at the root. " 

Example: Altschul-Carroll-Llpman weights for a three-leaf tree 

To illustrate how the weights are derived, consider the simple three-leaf tree 
shown in Figure 5.9, where leaf / takes the value jc, . The probability distribu- 
tion at node 4 is given by 

P(x at node 4 | Li, L2) = iTie 2f, e 
where ^1 is a normalising constant. One can rewrite this as 

P(jcatnode4 |Li,L2) = ^ie 2/12 

where vi — ^2/(^1 + ^2)* vi = h /(ti + ti) and tx2 = h ti/Oi + ^2). If we were consid- 
ering only the two-leaf tree with root at node 4, the mean of the root distribution 
would be given by = uijci + 1^2^:2, and the weights would be vi and Contin- 
uing with our three-leaf tree; however, we find next that the distribution at node 5 

Historically, the continuous case came first, and Felsenstein defined the pruning algorithm 
for Gaussian distributions of real-valued parameters. In the cited paper he takes account of 
iCithe distribution of the parameters iat each leaf, e.g. the mean and variance of tiie weight of 
^^r^ organism. Puzzlingly, he also introduces covariances between values for different leaves. 

It is not cl^ how to cdcidate a cpvariance between, say, the weights ofcows and cats. For 
: ^f'proteins, having inultiple corresponding sites in an aligmnent would allow correlations to be 
't>5iConsidered in principle. * ::. ; ^ - . ... . . ' 
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is given by 

P(y&tnodt5\Li,L2,L3) = K2e 2*3 /e 2,,2 g ^iTdx 

where K2 is a normalising constant, and the integral is taken over all possible 
values of x at node 4 (and is the exact equivalent of the sum over all possible 
ancestral assignments of residues in the case of a discrete alphabet). This is a 
standard Gaussian integral, and boils down to the following 

(y-WlXl-w2X2-w^x■^)'^ 

P(y at node 5 | Li, L2, L3) = K^c 2*123 

where ^3 is a new normalising constant and /,23 = ^3(^1^2 + ^4(/i +^2)}/^, with 
^ = ht2 + + /4)(/, + tj). The mean of the distribution of 3;, i.e. of the root 
distribution, is given by 

/X -= WiXi + W2X2 + WiX'i 

with wi = t2t3/Q, W2 = hti/Q, and W3 = {tit2+t4(ti +t2)}/ii. These are there- 
fore the Altschul-Carroll-Lipman weights for a tree with three leaves. □ 



Voronoi weights 

There are also weighting schemes not based on trees. One approach is based on 
an image of the sequences from a family lying in 'sequence space'. In general, 
some wiU Ue in clusters and others will be widely separated. The philosophy of 
the Voronoi scheme [Sibbald & Argos 1990] is to assume that this unevenness 
represents effects of sampUng, including the 'sampling' performed by natural 
selection in favouring certain phyla. A more thoix)ugh trawl through all eligible 
sequences of the protein family, or perhaps a multitude of reruns of evolution, 
should produce a flat distribution within some region. To compensate for the 
gaps, we want to give sequences a weight proportional to the volume of empty 
space around them. 

If sequence space were two-dimensional, or even low-dimensional, we could 
use standard methods from computational geometry to divide up space into re- 
gions around each example point. The standard approach is to take Hnes joining 
neighbouring pairs of points and draw their perpendicular bisectors, extending 
them till they join up. This produces a partitioning into polygons (in two di- 
mensions) caUed a Voronoi diagram [Preparata & Shamos 1985], which has the 
property that the polygon around each point is the set of aU points closer to that 
point than any other. • - - ' ' 

Sequence space is of course a high-dimensional construct in which the Voronoi 
geometry is hard to jpicture or calculate. However, we can implement the under- 
lying principle^f it byTsampling s^ijuences randomly from seqiiencfe space and 
testing to see which of the family sequences each sequence lies closest to. The 
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trick is in the sampling. This is accomplished by choosing, at each position of 
the alignment, uniformly from those residues which occur at that position in any 
sequence. If we count such sample sequences closest to the i\h family member 
(dividing up the counts if there is a tie), then we can define the iih weight to be 



Maximum discrimination weights 

Another approach to weighting comes indirectly, from focusing initially on a re- 
formulation of the primary goal in building the model [Eddy, Mitchison & Durbin 
1995]. Rather than maximising the likelihood of sequences in the family, or even 
their posterior probability derived from Bayesian priors, we are normally inter- 
ested in making the correct decision on whether sequences are members of the 
family or not. We are therefore interested in the probability 



P(M ijc) = 



P(x\M)P(M) 



P(x\M)P{M) + P(x\R)(l - PiM)y 

where x is a sequence from the family, M is the model for the family that we 
are fitting, R is our alternative, random model for sequences not in the family, 
and P(M) is the prior probability of a new sequence belonging to the family. 
Given example training sequences x*, we would like to maximise the probability 
of classifying them all correctly, which is 

Z) = I~[P(M|A 



not n P(x^\M) as usual with maximum likelihood based approaches. We call D 
the discrimination of the model on the set of sequences x*. Maximising D will 
have the effect of emphasising performance on distant or difficult members of the 
family. Sequences that are easily classified will have P(M\x) values very close 
to one; changing parameters to increase their likelihood P(x\M) will have very 
.little effect on D. On the other hand, increasing the likelihood of sequences for 
:which P(M\x) is small can potentially have a big effect, 
r,. It tums out that the parameter values that maximise D can be shown to be the 
.ones that maximise a weighted version of the likelihood, where the weights are 
proportional to 1 - P(M\xi), i.e. the probability of misclassifying sequence 
|;jThis can be seen from the observation that if y = e^/(A: + e^), then 

31ogy 1 



3x 



K + c' 



If we set X = log (^gg), which is the log likelihood ratio for sequence jc, then 
^ ="P(M\x)rSo at a maxStnum of log D we will alsiobeatTa maximum of the 
weighted sum of log likelihood ratios, with weights 1 - P(M\Xi), and since the 
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random model is fixed this is equivalent to a maximum of the weighted log likeli- 
hood of the model M . The maximum discrimination criterion therefore amounts 
to another sequence weighting system. 

One difference from previous systems, however, is that these weights are de- 
fined in a somewhat circular fashion; they depend upon the model that is being 
fit. When using maximum discrimination weighting as a method, an iterative ap- 
proach must be used; an initial set of weights gives rise to a model, from which 
posterior probabilities P{M\x) can be calculated, giving rise to new weights, and 
hence a new model, and so on until convergence is achieved. This iterative re- 
estimation procedure is analogous to the versions of the EM algorithm used to fit 
HMM parameters to sets of unlabelled sequences (p. 64 and p. 323). 

Maximum discrimination training has a big advantage in that it is direcdy op- 
timising performance on die type of operation that the model will be used for, 
ensuring that the most effort is applied to recognising die most distant sequences. 
On die other hand, exactly the same point can lead to problems. If there is any 
training sequence that has been misclassified, then die distortion needed to give it 
a good score can damage performance for correct members of die class. To some 
extent, though, this same problem occurs with all weighting schemes: incorrectly 
assigned^sequences will be die most distant ones in any tree that gets built from 
the examples. 



Maximum entropy weights 

Finally, we describe two weighting methods based on the idea of trying to make 
the statistical spread of the model as broad as possible. 

Assume column i of a multiple alignment has kia residues of type a and a 
total of nti different types of residues. To make a distribution as uniform as 
possible from these counts by weighting each sequence, we can choose a weight 
for sequence k of l/irriik^^k). Maximum likelihood estimation will then yield 
a distribution pia = kia/{mikia) = l/m,-, i.e. all die residues appearing in die 
colunm will have the same probability. To illustrate die idea, suppose we have ten 
sequences with residue A at a site, and one sequence \yith a B, so die unweighted 
frequencies of A and B are Ca = jy, Cb = jj. The weights of die ten sequences 
are = UJ2 = . . . = tt^io = 1/(2 x 10) = 0.05, and «;u = 1/(2 x 1) = 0.5, which 
have the effect of making the overall weighting for each of A and B equal. 

The preceding paragraph only considered one column. Widi just one weight 
per sequence, it is of course not possible to make the distribution uniform for all 
columns in an alignment. However, by averaging over all columns, one may hope 
to obtain reasonable weights. That is, the weights are calculated as 
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and then normaUsed to sum to one. This weighting scheme was proposed by 
[Henikoff & Henikoff 1994] . 

Instead of averaging, there is another approach to combining the information 
from the different columns that has a simple theoretical justification. A standard 
measure of the 'uniformity ' of a distribution is the entropy (11.8), which is larger 
the more uniform the distribution is. Indeed, it is easy to see that the weights cho- 
sen above based on a single column maximise the entropy of the distribution pia 
for that column. An HMM defines a probabiUty distribution over sequences, and 
therefore a natural extension of the single column weighting to full sequences is 
to maximise the entropy of the complete HMM distribution [Krogh & Mitchison 
1995]. We will see that, perhaps surprisingly, this is closely related to maximum 
discrimination weighting. 

Let us consider all the sites in an alignment with no gaps. We then sum the 
entropies from each site, and choose the weights to maximise this sum; that is we 
maximise Hi(w^.) + >^Efc«^fc' where Hi(w.) = Pialog Pia, and pia is the 
weighted frequency of residue a at the /th site, computed as above. 

Suppose for instance that we have the sequences = AFA, = AAC, and 
jc^ = DAC. Giving them weights wu m and 1^3, respectively, the entropies at 
each site are 

Hi{w.) = -(Wi + U;2)log(w;i + W2) - W3l0gW3, 

Hsiw.) = -u;ilogit;i-(w^2 + i^3)log(uJ2 + t^3)- 

We assume that the weights sum to one, and therefore we have to use a Lagrange 
multiplier term kJ2k^k, when differentiating and finding the maximum of the 
entropy. Setting the derivatives of Hi{w.) + H2iw.) + H^{w.) -^XYlk^k to zero 
gives {wi + u)2)w\ = {wi + W2){W2 + w^f = wz{w2 + UJ3)^ wMch implies wx = 
= 0.5, W2 = 0. This makes the frequencies in each column equal, which was 
our goal. If it seems odd to give a sequence zero weight, note that the residue at 
each site in is always present in one of the other two sequences. Intuitively, x'^ 
Ues 'between' jc^ and x^ (in fact, it would be a possible ancestral sequence of 
and x'^ in an evolutionary reconstruction based on parsimony; see Chapter 7). 
! Another way to view to the result of this example is that if we set the model 
"probabiUties to be the weighted counts frequencies, as a weighted maximum like- 
; .lihood procedure would, the resulting model assigns an equal probability to all 
' of the original sequences, x^ jc^ and x^. This seems very reasonable, accord- 
:ing to the view that all the example sequences should be treated as equally good 
-members of the family for which we are building the model. In fact, Krogh & 
■ Mitchison [1995] show that the maximum entropy proce dure as signs weights to 
^ the examplesequences so that some subset of the sequences (perhaps all of them) 
-have non-zero weight and equal probabilities under the resulting model, or they 
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have a higher probability, in which case they have zero weight. The former can 
be thought of as boundary points for the region of sequence space occupied by 
the whole sequence set, while the latter are internal points. 

Furthermore, empirical tests indicate that the maximum entropy weights are 
optimal in the sense that they maximise the minimum score assigned to any 
of the example sequences [Krogh & Mitchison 1995]. This is an absolute ver- 
sion of the criterion specified in the previous section on maximum discrimination 
weights; rather than simply weighting the weakest match most strongly, all the 
parameter-fitting effort is applied to increasing its score, until it reaches that of 
the other non-zero-weighted sequences. Although satisfying an attractive goal, 
maximum entropy weightmg suffers from the same problems as maximum dis- 
crimination: if a sequence is an outlier that should not be a fuU member of the 
family, the method will force it in, possibly at a substantial cost in performance 
on all other sequences. In addition, the rejection of all information from some of 
the sequences may seem intuitively undesirable. 



Exercise 



5.6 



^Compute the weights for the following sequence set, using each of the 
weighting methods described above except Voronoi weights (which re- 
quires random sampling of sequences): AGAA, CCTC, AGTC. 



5.9 Further reading 

PSSM methods were introduced during the 1980s for finding new members of 
sequence families, although the matrix values were not always obtained using an 
expUcit probability-based derivation. They are also known by other names, such 
as weight matrices [Staden 1988]. More recent papers using related methods 
include those by Stormo [1990]; Henikoff & Henikoff [1994]; Tatusov, Altschul 
&Koonin[1994]. 

The non-probabilistic versions of profiles already have a long history, and 
many variants of the profile method have been suggested and tested. Thomp- 
son, Higgins & Gibson [1994b] and Luthy, Xenarios & Bucher [1994] report an 
improvement when the sequences are weighted using one of the blosum matri- 
ces [Henikoff & Henikoff 1992] instead of a pam matrix. In Thompson, Higgins 
& Gibson [1994b] the treatment of gaps is also improved, _ ; q 

Several ways have been suggested for incorporating structural information into 
profiles. In Luthy, McLachlan & Eisenberg [1991] substitution matrices were es- 
timated for six different structural environments: ^ the three secondary structure 
elements or-helix, )3-sheet, and 'other' combined with an outside/inside classifi- 
cation, which was based on the exposure of an ammo acid to solvent. Other vari- 
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ations of structural profiles can be found in Bowie, Luthy & Eisenberg [1991]; 
^ilmanns & Eisenberg [1993]. 

Early on, profile HMMs were adopted by Baldi et al [1994], who used them to 
model globins, immunoglobulins and kinases. In this work a different estimation 
method was also introduced, which was based on gradient descent, see also Baldi 
& Chauvin [1994]. The same basic structure of profile HMMs has since been 
used in several different areas. A Ubrary of HMMs for all the big protein families 
has been established under die name of pfam [Sonnhammer, Eddy & Durbin 
1997]. The library of regular expressions called prosite [Bairoch, Bucher & 
Hofmann 1997] is bemg extended to something essentially like profile HMMs 
[Bucher et al 1996]. Profile HMMs also have several uses for DNA. For instance 
they can be used to find DNA repeat family members in large-scale genomic 
sequence. 
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Background on probability 



To make our book more self-contained, we have included a last chapter that gath- 
ers together the probabilistic ideas and methods we use. The various sections of 
this chapter are fairly independent, and can be dipped into as the reader wishes. 
Some parts are more mathematically technical than the rest of the book. 

11.1 Probability distributions 

We introduce here various probability distributions used throughout the book. 
When the outcomes we wish to assign probabUities to belong to a finite set X, 
a probability distribution is simply an assignment of a probability to each 
outcome x in X. For instance, the probability distribution of outcomes of rolling 
a fair die would be Px = l/6 for the six outcomes a; = 1, . . . , 6. 

If we have a continuous variable a:, like the weight of an object, then the prob- 
ability that that variable takes a specific value, e.g. that the weight is exactly 1 
pound, is zero. But the probability that x takes a value in some interval, PixQ < 
X < xi) say, can be well defined and positive. As the width of the internal tends 
to zero, we may be able to write P{x - 8x/2 <x<x+ Sx/2) = f(x)Sx, where 
/ (x) is a function called a probability density, or just density. The probability 
of an interval can then be derived by integration: P(xo <x<xi) = f' f(x)dx 
A density must satisfy f{x) > 0, for all x, an4 /_~ /(;cW;c = 1. BuTnote that 
we can have f{x) > 1. For instance, the density fix) = 10 for 0 < jc < 0.1 and 
f{x) = 0 elsewhere is well defined. 



The binomial distribution 

The first distribution wexonsider is perhaps the simplest and most familiai^'^the 
binomial distribution. It is defined on a finite set consisting of all the/iwssM^ 
results of tries of an experiment with a binary outcome, 'O' or 'iC If p is the 
probabiHty of getting a '1' and 1 - p that of getting a '0', the probabiUty that k 
out of the AT tries yield a T is 
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where (^) denotes the number of ways of choosing k objects from A^, that is 
N \/{{N — k)\k\\ and the factorial function is defined for non-negative integers as 
n! = n(n"l)-l,andO! = l. 

The mean m and variance of any distribution P are defined by m = YlkP{k) 
and — — ni)^P{k). The positive square root of the variance, a, is called 
the standard deviation. For the binomial distribution 

and 

We can show (Exercise 11.1) that m = Np and — Np(l — /?). 
Exercise 

11.1 Calculate the mean and variance of the binomial distribution. (Hint: To 
find m, differentiate the binomial expansion (p + q)^ = J2o (jt) P^Q^"^ 
with respect to p and set 9 = 1 — p. For the variance, carry out two 
differentiations with respect to p.) 



The Gaussian distribution 

Consider next what happens as we let —> 00. Both the mean and the variance 
increase linearly with N, but we can rescale to give fixed mean and variance, 
defining the new variable uby u = {k — m)/a = (k — Np)/^Np{\ — p). It is a 
classic result [Keeping 1995] that, in the limit of a large number of events, a bi- 
nomial distribution becomes a Gaussian (see Figure 11.1), and with the rescaling 
the density is 

/(M) = -^exp(-MV2). (11.2) 

This can be regarded as a special case of the central limit theorem, which states 
that the distribution of a sum of N independent random variables, normalised to 
the same mean and variance, tends to a Gaussian as AT -> 00. If a single variable 
takes values '0' or T with probabilities 1 — p and /?, respectively, the distribution 
of the suin of N copies oif this is 'P(fc) = P{X\ + ,,. + Xf^ < k), and is precisely 
the biiionaial considered above. ' • 

The multinomial distribution 

Hie generalisation of the binomial distribution to the case where the experiments 
have K independent outcomes with probabilities 0, , / = 1, , . . , AT, is the multino- 
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Figure 11.1 The limit for large N of a binomial tends to a Gaussian. In 
this case N — 40 and p— 1/4 in(ILI), 
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mial distribution. The probability of getting n,- occurrences of outcome / is given 
by 

K 

P(n\e) = M-\n)Y[e^^. (11.3) 
/=i 

Here we condition the probability on the parameters 9 of the distribution, which 
is a natural thing to do in a Bayesian framework, because then the parameters are 
themselves random variables. In a classical statistics framework the probability 
of n could, for instance, have been denoted by Po(n), The normalising constant 
only depends on the total number of outcomes observed, J2k fixed n^ 

it is 



M(n) = 



(11.4) 



For /sT = 2 the multinomial distribution reduces to the binomial distribution. 
Example: Rolling a die 

The outcome of rolling a die times is described by a multinomial. The prob- 
abilities of each of the six outcomes are called 0i, . . . ,^6. For a fair die where 
5i = . , . = ^6 = 1/6 the probability of rolling it a dozen times and getting each 
outcome twice is ^ ' 
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The Dirichlet distribution 

In Bayesian statistics we need distributions over probability parameters to use 
as prior distributions. A natural choice for a density over probabilities is the 
Dirichlet distribution: 

K K 

= Z-\a)We^^~'KY.^i - 1). (11.5) 
1=1 1=1 

Here a = ai,...,a^, with a/ > 0, are constants specifying the Dirichlet distri- 
bution, and the Oi satisfy 0 < ^, < 1 and sum to 1, this being indicated by the 
delta function term SQZi 6i — 1). The algebraic expression for the 0, is the same 
as for a multinomial distribution. Instead of normalising over the numbers of 
occurrences of outcomes, however, we normalise over all possible values of the 
Of, To put this another way, the multinomial is a distribution over its exponents 
All , whereas the Dirichlet is a distribution over the numbers Oi that are exponen- 
tiated. The two distributions are said to be conjugate distributions [Casella & 
Berger 1990], and their close formal relationship leads to a harmonious interplay 
in many estimation problems. 

The normalising factor Z for the Dirichlet defined in (1 1.5) can be expressed 
in terms of the gamma function: [Berger 1985] 

= / Uor'siZ^i - i)de = (11.6) 

The gamma function is a generalisation of the factorial function to real values. 
For integers r(n) = (n — 1)!. For any positive real number x, 

r(x + l) = jcr(jc). (11.7) 

It can be shown that the mean of the Dirichlet distribution is equal to the nor- 
malised parameters, i.e. the mean of 6i is of, / ak. For instance, the three distri- 
butions shown in Figure 1 1.2 all have the same mean (1/8, 1 /4, 5/8), even though 
the ofs for the top right figure are 10 times larger than those for the top left. Note 
that larger of s produce a tighter distribution. Note also that when some a, < 1 the 
distribution is peaked at zero for the corresponding 0,, as shown in the bottorii 
left figure. 

For two variables (K = 2) the Dirichlet distribution reduces to the more widely 
known beta distribution, and the normalising constant is the beta fimction. . , 

Example: The dice factory 

Consider again our example from Chapters 1 and 3 of a probabilistic model of a 
possibly loaded die with probability p arameters 9 =^i, . . . ,06. Sampling proba^., 
bility vectors 6 from a Dirichlet parameterised by a = ai, . . . is like a 'dice 
factory' that produces different dice with different 0 [MacKay & Peto 1995]. 
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Figure 11.2 Examples of three-dimensional Dirichlet distributions, each 
shown by sampling 10000 points, i,e, by choosing points 0 with probability 
3)(e\a), The values of a used are (1,2,5) in the top left figure, (10,20,50) 
top right and (0.1,0.2,0.5) bottom left. The probabilities 6 are displayed as 
the slice through 3D space (^1,^,^3) where J^Oi = 1; see the bottom right 
figure, A point (^1,^,^3) is mapped to ((O2 — 0\)/^,6^) in the plane. 

Suppose dice factory A has all six set to 10, and dice factory B has all a, set 
to 2. On average, both factories produce fair dice; the average of Oi is \ in both 
cases. But if we find a loaded die with 06 = 0.5, 0i = . , . = ©5 = 0.1, it is much 
more likely to have been produced by dice factory B : 

= ^i^(0.1)^C''-"(0.5r-i=0.119, 

m\ccB) ^^(0.1)^2-«(0.5)2-i = 199.6. 

The factory with the higher a parameters produces a tighter distribution in 
favour of fair dice. The sum is inversely proportional to the variance of the 
Dirichlet. (Don't be alarmed by the Dirichlet density having a value of 199.6; 



recall that the values of continuous probability densities at any point may be 
greater than one.) 
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Figure 11 J Gamma distributions g(x,a,fi)for a = B=lO a = B = 60 
and a = 2.0,^ = 1.0. ' ' 



A factory that produced almost perfectly fair dice would have very high but 
equal a, . A factory that produced variably unreliable dice that are still fair on 
average-would have low but equal a,- . q 

The gamma distribution 
The gamma distribution g(x,a,fi) is given by 



g(x,a,fi) = 



r(a) ' 



and is defined for 0 < x,a,p < oo. Its mean is a/fi and variance a/fi\ is 
simply a scale parameter. 

The gamma distribution is conjugate to the Poisson, /(n) = e'^p" //j, which 
gives the probabiUty of seeing n events over some interval, when there is a prob- 
ability p of an individual event occurring in that interval. Since the number of 
events in an interval is a rate, the gamma distribution is appropriate for modelling 
probabiLties of rates, just as the Dirichlet is appropriate as a prior for emission 
probabilities when its conjugate, the multinomial, is used to assign probabilities 
to counts (p. 319). The gamma distribution has been used to model the rate of 
evolution at different sites in DNA sequences (p. 215). 



The extreme vaiiie distoibulion 



-•rela 



Suppose we.take samples from the density ;g(x). The probabiUty that.t@i. 
largest amongst them is less than jc is G(a:)^, where G{x) = g{M)duJ^£h^ 
density for the largest value of the set of N is given by differentkting this with 
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^ = 6.0 



sry high but 
still fair on 



respect to x, giving Ng{x)G{x)^'\ The limit for large N of Ng(x)G(x)^-^ is 
called the extreme value density (EVD) for g{x). It has a wide variety of practical 
uses, from modelling the breaking-point of a chain (which is determined by the 
weakest link), to assessing the significance of the maximum score from a set of 
alignments (see Chapter 2). 

Let us compute the EVD when g(x) is the exponential density g(x) = ae""^. 
Integrating gives G{x) = 1 - e"*'''. Choosing y so that e~"^ = and writing 
z = jc — J, we find 



Ng(x)G(x) 



iVae-"^(l-e-"0^"^ 



:ae-"^(l-e-"ViV) 



ae exp(-e for N ^ oo, 

where we used the well-known limit (1 - X/N)^ -> e~^ for N ^ oo} The cu- 
mulative probability (the probability that the extreme value is < x) is exp(-e""^), 
and is called a Gumbel distribution [Gumbel 1958]. The above density often gives 
a good approximation to the distribution of extreme values for moderate values 
of N. With the exponential density. Figure 11.4 shows that the maximum of a 
sample of size 10 gives a close approximation to the EVD. 

It is a surprising fact that the Gumbel distribution is the EVD for a variety 
of underlying densities g{x)\ it holds when ^(jc) is a Gaussian too, for instance. 
More generally, an EVD must have the form exp(-/(aA^x + ^a^)), where ayv and 
Z?// are constants depending on N and f{x) is either an exponential e"^ or \x\~^ 
for some positive constant X (see Waterman [1995] for a more precise statement 
of this theorem). 



^ a//32. ^ is 

which 
2re is a probp 
le number of 
or modelling'- 
for emission*! 
probabilitiesj 
el the rate'bfl 



• u i 

iHoility-thatiifiy 
g(u)duJjlb% 
ting this wim] 



11.2 Entropy 

Some of the tenninology used in the book is borrowed from information theory 
(see e.g. Cover & Thomas [1991]). Information theory has strong connections to 
probabilistic modelling. 

An entropy is a measure of the average uncertainty of an outcome. Given a 
random variable X with probabilities P{xi) for discrete set of K events xu,..^XKy 
the Shannon entropy is defined by 

H{X)^-Y^P{Xi)\ogP{Xi). V' (11-8) 

I 

In this definitibn^ P(jc,)i6g'P(j:,) is taken to, be zero if P(xi) = 0. Normally 
we assume that log is the natural logarithm' (sonietimes' written In). However, 
it is common to use the logarithm base 2 (called log2), in which case the unit of 

^ There is one delicate point in the aboye.argument |We have to take care that e~"^,, cannot 
r grow rapidly with N, and so invalidate the limiV.(i - cy^/N)^^^ expi~c:J^)y, i:o be 
■ ' more precise, one has to show that the probability of large values of e~"^ according to the 
distribution A^g(jc)G(x)^~^ beconies vanishingly small. , : * = ' ^ 
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Figure 11.4 Approximations to the extreme value distribution obtained by 
sampling N points from the distribution e^"" onO<x< oo, and then taking 
the^tnaximum. From the top left to bottom right, = 1,2, 10, 100. 



entropy is a *bit'. All logarithms are proportional, e.g. log2(;c) = loge(jc)/loge(2), 
so theoretically it does not matter which logarithm is used. Often we talk about 
the entropy of the probability distribution P, H{P\ instead of H{X), 

The entropy is maximised when all the P(jc/) are equal {P{xi) = 1 /K) and we 
are maxunally uncertain about the outcome of a random sample. The maximum 
is the - K^^?>Y^ ^ ^e are certain of the outcome of a sample from 

the distribution, i.e. P{xk) = 1 for one k and the other P(jc/) = 0, the entropy is 
zero. 

Entropy also arises as the expected score of the sequences generated by certain 
probabilistic models when the score is defined to be the log probability. Suppose, 
for instance, that the probability of residue a in some position in a sequence is 
Pfl. Then there is a probability pa of score logp^, and the expected score is 
Ha Pa log namely the negative entropy. The same is true (see Exercise 11. 2) 
when the model defines the probabilities at a set of independent sites. . >f 

If you are told the outcome of an event, the uncertainty is reduced from H to 
zero, because you have gdned information. Therefore entropy is often equat^ 
witii information. This can be confusing; it leads to the quite counterintuitive 
view tiiat tiie more random something is (the higher the entropy), the more iii- 
formation it has; It is not confusing if^we^nk of information as a difference in 
entropy. More genially; m/orvwari^^^ 

a reduction in uncertainty after some 'message' is received; hence, the difference 
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between the entropy before and the entropy after the message: 

7(X)= //before -//after. (11.9) 

The uncertainty is not always reduced to zero; there may be noise on the com- 
munications channel, for instance, and we may remain somewhat uncertain of 
the outcome, in which case Hafter is positive and the information is less than the 
original entropy. 

In information theory it is often assumed that the probability distributions are 
known exactly. In many applications, however, the true distributions are not 
known, and therefore entropies are calculated from the frequencies of events 
rather than the true distributions; see Examples below. 



Example: Entropy of random DNA 

If each symbol (A, C, G, or T) of a DNA sequence occurs equiprobably (pa = 
1 /4) then the entropy per DNA symbol is — Yla Pa 1^82 Pa = 2 bits. 

We can think of the entropy as the number of binary yes/no questions needed 
to discover the outcome. For example, for random DNA, we need two questions: 
'purine or pyrimidine?' followed by 'A or G?' if the answer is 'purine', and *C 
or T?' otherwise. □ 



Example: Information content of a conserved position 

Information content can be used to measure the degree of conservation at a site 
in a DNA or protein sequence alignment. Say we expect a DNA sequence to be 
random (pa = 0.25; //before = 2 bits), but we observe that a particular position in 
a number of related sequences is always an A or a G with = 0.7 and pc = 0.3. 
Thus //after = -0.71og20.7 -O.SlogjO.S = 0.88 bits. The information content of 
this position is said to be 2 — 0.88 = 1.12 bits. The more conserved the position, 
the higher the information content. 

Notice, however, that the information content can be negative if the observed 
distribution has a higher entropy (is more 'random') than expected. For find- 
ing unusual patterns it is therefore better to measure the difference between the 
distributions by the relative entropy described below. □ | 

Exercise ,^ , ...... r.. : M 

11.2 Assume a model, in which pi(a) is the probability of amino acid a oc- 

curring in the I th position of a sequence of length /. The aniino acids i | 

are considered independent. What is the probability P(x) of a particular | 

sequence x = jci , . . . , jc/? Show that the average of the log of ttie proba; 1 1 

bility is the negative entropy J2 ^(^) l^S ^(^)» where the sum is over all i 

possiblesequencesjc of length/. . |; 
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Figure 11^ Proof that the relative entropy (11.10) is always positive or 
zero // P(xi) = Q(xi) for all i. From this graph is can be seen that 
logU) <x-\ with equality only if x = 1. It follows that -H(P\\0) = 
E! P(xj)logiQ(xi)/P(xO) < E, P(xm(xi)/P(xi)- 1) = 0, with eqLl- 
ity holding only if for each i, Q(xi) = P(xi). 



Relative entropy and mutual information 
We return to the definition of different types of entropy. For two distributions 
P and Q the relative entropy (also known as the KuUback-Leibler 'distance') is 
defined by 



wiiG)=x:^(^,)iog^. 



(11.10) 



Information content and relative entropy are the same if the g is a uniform 'back- 
groimd distribution' (G(x,) = i) that represents a completely naive initial state 
tor flbefore- The two terms are sometimes used mterchangeably. 

Relative entropy has the property that it is always greater than or equal to zero 
It IS easy to show that H(P\\Q)>0 with equality if and only if P(;c,) = Q(xi) for 
aUi (see Figure 11.5). It is often useful to think of the relative entropy H(P\\Q) 
as a distance between the probability distributions P and Q. However it is not 
symmetric, H(P\IQ) ^ H(Q\\P), and it does not fulfil the formal requirements 
of a proper mathematical distance measure. 

•nie relative entropy often arises as the expected score in models where the 
score IS defined as the log-odds. i.e. P(data|M)/P(data|/?), where M is tbc'i^ 
del, and /? is a nuU model. If is the probability of residue a in some position'in 
a sequence according to M, and its probability according to R, then the score 
for residue a is log(pM, and the expected score is Pa log(p. /q.), which is 
the relative entropy. ' - ' ■■-<■ .• 

-" Another important entropyTaeasuie is the mw^^^ ^iWrandoiri 
vanables X and y are independent if PiX, Y) = P(X)P(Y). It is interesting to 
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know how independent they are, and that can be measured by the relative entropy 
*distance' between the distributions P{X, Y) and P{X)P{Y), 



M (X; Y)^y >;,.) log 



(11.11) 



where the possible values for X and Y are {x/} and {yj}. This is the mutual 
information. M(X\Y) can be interpreted as the amount of information that we 
acquire about outcome X when we are told outcome Y. 

The mutual information is maximal when X and Y always covary. If for in- 
stance all pairs except AT, TA, GC, and CG have probabihty zero for two po- 
sitions / and j in some aligned DNA sequences, there is maximal covariation. 
For this situation we will always have P(xi,yj) = = P(yj) or P(xi,yj) = 0, 
and therefore M = -J2. P(Xi)log P(xil This is the entropy of X (or Y), so it is 
maximal for a uniform distribution, and the maximum is log A' (assuming that X 
and Y have the same number, K, of possible outcomes). The maximum mutual 
information for DNA sequences is therefore log24 = 2 bits. 

In Figure 10.6 the mutual information (calculated from frequencies) between 
every pair of columns in an RNA alignment is shown. 

Example: Acceptor sites 

Relative entropy is useful for finding unusual patterns in biological sequences. To 
illustrate tiiis we extracted 757 acceptor sites from a database with human genes. 
The acceptor site is the splice site at the 3' end of the intron where the intron 
is spliced out to make the messenger RNA. The last two bases of the intron are 
almost always AG, and in this dataset they all are. We only took acceptor sites of 
introns occurring between two codons, i.e. not splicing in the middle of a codon. 
We extracted 30 bases upstream of the splice site and 20 bases downstream. In 
Figure 11. 6 you see a small arbitrary sample of the sequences. 

At each position i the frequency pi(a) of the four nucleotides was found, and 
the relative entropy J2aPi(^)^^B2[Pi(^)/Qa] calculated, where qa is the overall 
distribution of the four nucleotides in the sequences. We plot this in Figure 1 1.6. 
At the AG consensus the relative entropy is very high (equal to -log2(^A) and 
-log2(^G) respectively). There is an interesting structure in the relative entropy 
upstream of the site with a minimum just two bases before the AG. There is a 
weak periodic signal (barely visible) of the relative entropy in the coding region, 
which is due to;the different base composition in the three reading frames. See 
Brunak, Engelbrecht & Kmidsen [1991] and Hebsgaard et al [1996] for more 
discussion of information in splice sites, and Schneider & Stephens [1990] for 
colourful ways of displaying various entropy measures. ' 

To see if the neighbouring positions are independent, the mutual information 
between the columns was calculated. For two neigbouring colunms (say / and 
i + 1) the frequency of pairs Pi(a,b) was found by counting how many times 
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Example sequences 

CTTCTCAAATAACTGTGCCTCTCCCTCCAGATTCTCAACCTAACAACTGA 
CTGCTCACCGACGAACGACATTTTCCACAGGAGCCGACCTGCCTACAGAC 
GGTTCCCTCTTGGCTTCCATGTCCTGACAGGTGGATGAAGACTACATCCA 
ACTAACTCTCCTCCTCGTGTGTCTCCCCAGCCCGTGTCCCAGCCCACCCA 
TTGATAACATGACATTTTCCTTTTCTACAGAATGAAACAGTAGAAGTCAT 
TCTACCGTCCCTTTCCCACACACTCTGCAGAAGGTGGTGTTGTCTTCTGG 
CTTTTTTCTCTCCTATGTGCATCCCCCCAGGAGCTGGCTGAATATGAATA 
GCTAATAGCTTGCTTATTTATTTAACATAGGGCTTCCGTTACAAGATGAG 
AATTTAGTTTTATTCCCATGTGACCTGCAGGTAAATGAAGAAGGCAGTGA 
ACTCTGCTCACTGTCACTTTGCTCCCACAGCGTCCGCTCTGCAATGGCAG 
ACCTCCTAACGTTGTTGGGTTTCTTTGCAGAACTTTGCTGCCCAGATGGC 
GTAAACCCCTCATTTTCTGTTCCGATGCAGGGCCCCATGGGACCTCGAGG 
AGAAGTGACATTTTTCCTATATGTTGACAGGGTGGTGACTTCACACGCCA 
CTGGTGTGAGGACCTGCCTCTCTTTTCAAGGGTGAACCTGGTATTGCTGG 
ACCTTGGGCACTGTGTTCCTTTGTTTCTAGCACTGGCAGATCCCCCTGAG 
TTTTGTTATGCAATTATTGTTTTCTTACAGGGCCCTCTACTAAAGAAGGA 
GCATCACCTGTCAGCTCCCTGTGTCCACAGGCTCTGCAGCGGCTCAGGGA 



Figure 11,6 Plots of relative entropy and mutual information for acceptor 
sites. Below is shown a sample of the sequences. Note the peak in relative 
entropy and dip in mutual information at the conserved AG. 



a occurred in column / and b occurred in column / + 1. From this the mutual 
information Ea^P/(«.fr)log2[pi(a,6)/A(«)p,-+i(fe)] was calculated, and is also 
plotted in Figure 11.6. 

Notice that the mutual information is zero at the AG consensus: knowing that 
the first is A conveys no information about the next position, because it is al- 
ways a G. The mutual information around the acceptor site is much less than the 
maximum of 2 bits, but it is non-zero, and it shows that there are correlations be- 
tween neighbouring positions. This is true in most DNA. A clear periodic pattern 
is seen for the coding region, showing that the nucleotides are dependent in the 
three reading frames. □ 
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Exercises 

1 1.3 Prove the above assertion about the equivalence of information content 
and relative entropy when q is uniform. 

11.4 Show that M{X\ Y) = M{Y\X). 

11.5 Show that M{X\Y) = H{X) + H{Y)~ H{Y where H{Y,X) is the 
entropy of the joint distribution P(X, Y). 



11.3 Inference 

Probabilistic models are the main focus of this book. A model can be anything 
from a simple distribution to a complex stochastic granmiar with many implicit 
probability distributions. Once the type of model is chosen, the parameters of the 
model have to be inferred from data. For instance, we may model the outcome of 
rolling a die with a multinomial distribution. Suppose the number of observations 
yielding / is n, (/ = 1, . . . ,6). We do not know if it is a fair die, so we need to 
estimate the parameters of the multinomial distribution, i.e. &e probability Oi of 
getting / In a throw of the die. Here, we consider the different strategies that 
might be used for inference in general. For more background, see Ripley [1996] 
and MacKay [1992]. 



Maximum likelihood 

Let us suppose, then, that we wish to infer parameters 6 = [6i] for a model M 
from a set of data D. The most obvious strategy is to maximise P{D\6,M) over 
all possible 6. This is called the maximum likelihood criterion. Formally we write 



9^ = argmaxP(Z)|6>,M). 

e 



(11.12) 



Generally speaking, when we treat P(jc|y) as a function of x we refer to it as a 
probability; when we treat it as a function of y we call it a likelihood. Note that a 
likelihood is not a probability distribution or density, but simply a function of the 
variable 

Maximum likelihood has some desirable properties. For instance, it is consis- 
tent, in the sense that die parameter yalue used to generate the dataset will 
also, in the. limit of a large amount of data, be'^the value that maximises the 
likelihood. To see this, suppose there are K observable outcomes coi,,,,,cok 
of the model M (e.g. the 4" possible assignments of nucleotides at a site in an 
"^ghed set of sequences). - Then the frequency nf / J|]ni of occurrences of (Oi 
will tend to P((Oi\Oo,M) as the amount of data increases (see Exercise 11.6). 
Hence the log likelihood for parameter 0, given by J^i(ni / J^nk) log P(eoi \6, M) 
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tends to X!,- P{(Oi\6Q,M)\ogP{(Oi\9,M)^ The positivity of relative entropy im- 
plies that P{(Oi\eoM)\ogP{o)i\eoM) > E/ P{(Oi\6oM)\ogP{coi\eM\ for 
all 9. Thus the likelihood is maximised by 

A drawback of maximum likelihood is that it can give poor results when the 
data are scanty; we would be wiser then to rely on more prior knowledge. Con- 
sider the dice example and assume we we want to estimate the multinomial pa- 
rameters from, say, three different rolls of the dice. It is shown on p. 319 that the 
maximum likelihood estimate of 9i is «, /^rik, i.e. it is 0 for at least three of the 
parameters. This is obviously a bad estimate for most dice, and we would like a 
way to incorporate the prior knowledge that we expect all the parameters to be 
quite close to 1/6. 

Exercise 

11.6 The weak law of large numbers says that the mean of a sample of size 
N differs from the true mean by an amount d or more with probabil- 
ity a'^/{N4\ where is the variance of the distribution. Show that 
this implies that nt/J^^k tends to P(a)i) as E«Jt ^ oo, where n,- is the 
frequency of occurrence of o)/. 

The posterior probability distribution 

The way to introduce prior knowledge is to use Bayes' theorem. Suppose there is 
a probability distribution over the parameters 9, Conditioning throughout on M 
gives the following version of Bayes' theorem; 



Pi9\D,M) = 



P(D\9,M)P(9\M) 



(11.13) 



P(D\M) 

The prior P(9 \ M) has to be chosen in some reasonable manner, and that is the 
art of Bayesian estimation. This freedom to choose a prior has made Bayesian 
statistics controversial at times, but we believe it is a very convenient framework 
for incorporatmg prior (biological) knowledge into statistical estimation. 

P{9\D,M) is the posterior probability for the parameters, given the data and 
the model. The posterior can be used for inference in various ways. We can 
sample from it (see Section 11.4), and thereby locate regions of high probability 
for the model parameters. In Section 8.4 we show how this can be done for 
probabilistic models of phylogeny. If we want a specific set of paranieter valiies 
for the model, we might be guided by analogy with ML and choose the maximum 
a posteriori probability (MAP) estimate. 



9^ = aigmaxP(D\9,M)P(0\My 



(11^14) 



WW 



Note that we ignore the data prior P{D\M), because it does not depend on the 
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parameters 9 and thus the maximum point is independent of it. Another 
possibiHty is to take the posterior mean estimator (PME), which chooses the 
average of all parameter sets weighted by the posterior: 



e 



PME 



=/ 



ep(e\n)de. 



(11.15) 



The integral is over all valid probability vectors, i.e. all diose that sum to one. 
In the following we will derive the PME for a multinomial distribution widi a 
certain prior. 

Both MAP and PME estimators are considered a little suspicious, because a 
non-linear transformation of the parameters usually changes the result. In tech- 
nical terms they are not equivariant [Ripley 1996]. To see what's going on, we 
need to consider the effects of change of variables on densities. 



Change of variables 

Given a density /(jc), suppose there is a change of variable x = Then 
we can define a density g{y) by g{y) = f(<l>iy))\(pXy)[ The derivative of 0, 
<pXy)^ is there because the interval Sx corresponds to an interval Sy^\y) under 
the transform 0, so the amount of the / density that is swept out under 0 is 
proportional to this derivative; taking the derivative's absolute value ensures that 
the density is positive. This definition produces a correctly normalised density 
because / g{y)dy ^ f /(0(y)) \<l>\y)\dy = / f(x)dx = 1, / being a density. We 
write the transformation rule formally as 



8(y) = n(l>(y))W(y)\- 



(11.16) 



The function /((piy)) clearly has the same maximum as fix). When we multiply 
by \<p\y)\, however, this maxunum may shift (see Exercise 11,7). Now, the pos- 
terior P(9\D,M) is a density, so the peak chosen by MAP can likewise change 
under a transformation. A similar argument shows that the PME can change un- 
der a coordinate transformation. 

In contrast, the likelihood P(D\e,M) does not transform like a density; it is 
simply a function of 6 and a change of coordinates leaves the peak unchanged, 
just as the peak of f((f>(y)) remains the same as that of fix) [Edwards 1992]. 



not depend omtlie 



Exercise 

11.7 Let fix) = 2(1 -Jc) be a density on [0, 1]. Show how this transforms to 
a density Wy under x = y^. Show that die peak and the PME of the 
density both shift under this transformation. ; ^ * 
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11.4 Sampling 

Given probabilities P{xi) defined on the members jc, of a finite set X, to sample 
from this set means to pick elements jc/ randomly with probability P{xi). 

The basic practical tool for sampling is a function derived from a computer's 
pseudo-random number generator (i.e. the function called rand[], or something 
similar), that picks numbers randomly from the interval [0, 1] with the uniform 
density. Let us call this function rand[0, 1]. Using it, we can choose elements 
Xi widi frequency We set y = rand[0, 1], and then choose our element jc, 

by finding that / for which P(jci) + . . . + P(jc,-i) < rand[0, 1] < P{xx) + , . . + 
^fe-i) + P(^i). Clearly, the probabiUty of rand[] lying in this range is P{xi\ so 
Xi is picked with the correct probability. 

It is actually not easy to produce random numbers with a computer. The stan- 
dard function for pseudo-random numbers is usually very primitive, and will not 
be good enough for some applications. For example, the standard rand[] func- 
tion on many UNIX computers returns an integer between 0 and 2^^ — 1, and 
one would expect to obtain *random' bits (0 or 1) with this function by taking the 
value returned modulo 2. However, this gives a sequence where 0 and 1 alternate, 
which |s clearly not random at all. On most systems there are other (and better) 
functions to choose from. See for instance Press et al [1992] for a discussion of 
random number generators. 

Sampling by transformation from a uniform distribution 

The concept of sampling applies also to densities: Given a density /, to sample 
from it is to pick elements x from the space on which / is defined so that the 
probability of picking a point in an arbitrarily small region SR round the point x 
is f(x)SR, Sampling of densities can be accomplished by using pseudo-random 
numbers that sample from the uniform density on [0, 1], and applying a change 
of variables that changes the density appropriately. 

The theory of this goes as follows: Suppose we are given a density /(jc), and a 
map x = (piy). From (11. 16) we know that ^(>^) = f(ji>{y)Wiyy If / is uniform, 
we have g{y) = 4>Xy\ so ^ can be obtained by integration, 4>{y) = // g{u)du\ 
where b is some suitable lower bound. However, we want to pick points in i 
usmg a good pseudo-random number generator, and then map them to 3?'; -For 
this, we require the inverse function to 0, namely y = <^~^(x). . ^ t ifi 

Suppose for instance that we want to sample from a Gaussian. We define the 
cumulative Gaussian map 4>iy) = /^^e""^/2/-v^rfM, and let y = 0-^(x). We 
could make a look-up table to evaluate the inverse cumulative Gaussian fimctibn, 
but this is rather clumsy^ arid some other approach may be knore convenien t (elg . 
Exercise 11. 10).^ '^:c::-- -/c^K ■ ■ -^r'-n.; ^ y \ 

The transformation method also applies more generally to functions of K vari- 
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ables, but then (11.16) must be replaced by 

where J{<t>) is the Jacobian, whose (/, j)-th entry is dipi/dyj [Feller 1971]. 
Exercises 

11.8 Show that the function g{y) = a^Xy^'^Ha^ + y^f is a density on 0 < 
3; < oo. Show that picking x uniformly from (0, 1) and mapping x to 
y = a(Yzj)^^^ samples from giy), 

1 1.9 Define a mapping <t> from the variables (x, y) to (w, u;) by = wuj, y = 
(1 - m)u;. Show that /(0) = u;, where / is the Jacobean. 

11.10 (Calculus needed!) Suppose we pick two random numbers jc and y in 
the range [0, 1] and map (jc,y) to the sample point cos(27rx)log(l/y^). 
Prove that this samples correctly from a Gaussian. This is called the 
Box-MuUer method [Press et al 1992]. 

Sampling from a Dirichlet by rejection 

We consider now the problem of sampUng from a Duichlet, which illustrates 
some Tmportant principles. Suppose first that we can sample from the gamma 
distribution g{x,a, 1) 

g(jc,a,l) = e-^jc«-Vr(a) 

for 0 < x < 00 (see p. 304). If we take sampled values x\ and X2 from two ganmia 
distributions with parameters ori and of2, respectively, then we can define a pair 
(m,v) with M + 1; = 1, by setting u = xi/(xi +X2), v = X2/{xi +X2); equivalently, 
we can set jci = uw, jc2 = (1 - u)w and integrate over w, Usmg (1 1.17) and the 
results of Exercise 11.9, the distribution D(u,v) of paks (u,v) is given by 

f^S(u + v- l)e-""'(Mw)"^-^e-^"^(t;it;)^^-^tt;Jt/; 



DM = 



r(ai)r(«2) 
Jo 

r(ai+a2) 
r(ai)r(a2) 



r(ai)r(a2) 




(11.18) 



t^where 3)(m, v|ai,a2) is the Dirichlet distribution with parameters ai,of2- In otheir 
words, to sample firom a Dirichlet distribution of two variables (a beta distribu- 
tion), we sample from two gamma distributions, whose exponents are those of the 
components of the Dirichlet in question, and then normalise the sampled num- 
;bers to give probabilities: This elegant result extends to Dirichlets of any number 
of variables (Exercise 11.11). 
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Figure 11.7 Rejection sampling: We wish to sample from a gamma distri- 
bution ^(jc,a,l) (continuous line). It is possible to sample from the func- 
tion f given by (11.19) ('+' signs), whose value always exceeds that of 
the gamma distribution. Having sampled a point x from /, this point is 
accepted with a probability equal to the ratio of the gamma distribution 
and f at that point, i.e. with probability g(jc,a, l)//(jc). The left figure 
shows f with a = 5, k — % the right with or = 5, >. — 1. 



We can sample from a Dirichlet, therefore, if we know how to sample from a 
gamma distribution. Now we can show (Exercise 11.12) that g{x,a, 1) < /(jc), 
where ^ 

and X = ^Tjol— 1. It follows that, if rand[0, 1] truly samples uniformly between 
0 and 1, then P(rand[0, 1] < g(jc,a, l)//(x)) = g(jc,a, 1)//(jc). Thus if we first 
sample from the distribution /, picking a point x with probability f(x\ and 
accept JC if rand[0, 1] < g{x^a, \)lf{x\ then 

P{x) = /(jc)P(rand[0, 1] < l)//(x)) = g(x,a, 1). 

So this two-stage procedure enables us to sample from the gamma distribution. 
It remains only to show how to sample from /. But Exercise 11.8 shows that 
choosing u from [0, 1] by rand[0, 1] and defining x = a(M/l — m)*/^ is equivalent 
to sampling from /. For more details of the material in this section, and also for 
the appropriate procedure in the case where 0 < or < 1, see Law & Kelton [1991]. 
Figure 11 .2 was generated using this method. 

This is an example of rejection sampling, OiQ distribution g being obtained 
by /trinuning down' from the distribution /, which is analytically tractable and 
always larger than g. This only works well if f(x) is a good approximation to 
g(j:,a,l); if it is not, the rejection rate will be high. The function / gives a 
good approximation to g(x,a, 1) over the range where both functions are large, 
Le. where they will be' most frequently sampled~fr6m.^The choice of A is in 
fact optimal for this purpose. For instance, with a = 5 and A = V2a 1 = 3, 
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only 14% of points are rejected (Figure 11.7, left figure), whereas with X = 1 
(Figure 1 1.7, right figure), 65% are rejected. 

Exercises 

11.11 Show that (11.18) can be extended to the case of K gamma distribu- 
tions, i.e. that sampling from g(jc,a„ 1), for / = 1, . . . , /C, then averaging, 
is equivalent to sampling from the Dirichlet D(0i,...,0A:|ai,...,of/(:). 
(Hint: Show that the Jacobian of the map jc, = m,uj, for i <K-\, and 
Xf^ =(1- J]M|)u;isequaltoit;^"^) 

11.12 Prove that g(x,Q!, 1) < /(x), for all a: and a > 1 and 1 > X ^ V2a- 1, 
where f{x) is defined by (11.19). What happens when X > V2a-1? 

Sampling with the Metropolis algorithm 

We often want to sample from a probabilistic model, where the analytic methods 
that underUe the transformation method or rejection sampling are not available. 
One possible approach then is to use a Markov chain defined on the space X of 
outcomes [Neal 1996]. We assume here that X is finite, although the ideas carry 
over to continuous variables and densities. 

Given a point x, a chain specifies a probabiUty r(y\x) for the transition x y 
to a point y. If we can sample from the distribution r(y\x), i.e. given x can pick 
a y with probability r(y\x), then we can generate a sequence [yi] where each yt 
is picked by sampling from the distribution r(y |yi-i). 

Suppose now that we can find a t satisfying 



P(x)r(y\x)^Piy)r{x\y). 



(11.20) 



This is called the condition of detailed balance. It turns out that detailed balance 
implies 



— lim dyi =x) = P(x), 



(11.21) 



for aU points jc, where C(y/ = x) is the number of times yi=xm the sequence of 
length A^. We can therefore approximate P as closely as we like by taking long 
enough sequences of {y,} sampled using r. This statement needs to be quaUfied: 
Clearly, the chain needs to be able to reach every point y from any other point x; 
in other words, there must be a sequence of transitions that can go firom x to y, 
for any jc and y. ^ ' ' ' : 

t If we have a transition process t that satisfies (1 1.20), therefore, the sequences 
it generates wiU sample P correctly. But can we find such a process? A method 
that achieves this is the Metropolis algorithm. It has two parts: ' ;_J^ _ 

(1) A syirimetricpropo^^ point y 

with probability F(y\x). Symmetry means that F(y|x) = F{x\y\ - '-^ ' 
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(2) An acceptance mechanism that accepts the proposed 3^ with probability 
min(l,P(y)/P(x)). In other words, a point y with larger posterior probability 
than the current x is always accepted, and one with lower probability is accepted 
randomly with probabiUty P(y)/P{x). 

To see that this satisfies (1 1.20) note that, for jc 7^ y, 

P(xMy\x) = P(x)F(y\x)rmn{hP(y)/P(x)) 
= F(y|x)min(P(jc),P(y)) 
= F(x|y)min(P(y),/>(A:)) 
= P(y)r(x\y). 

Here we used the symmetry of the proposal mechanism to replace F(y\x) in the 
second line by F(jc|y) in the third. 

Gibbs sampling 

When we have a probabilistic model of many variables, it may often be possible 
to sample from the distribution obtained by keeping all variables fixed except one, 
i.e. die conditional distribution. Gibbs sampling exploits this idea. It works by 
choosing points from the conditional distribution P(xi |jci, . , . jc/-|_i, . . . , jca^) 
for each /, cycling repeatedly through i = l,,,,,N. 

To show that this samples correctly from P, it is enough to prove detailed 
balance. This means that 

P(x\^ . . . ,Xn)P(Xi |xi, . . . ,A;,*_i,;Ci+i, . . . , a:„) 

= PiXi, . . . ,Xi-i,Xi,Xi^i, . , , ,Xn)P(Xi\Xij , , , ,Xi-uXi^i, , , . ,Xn)^ 

But we can rewrite diis as 

P(Xu ' ' ' ,Xn)PiXu . . . . . . JCn)/ PiXi,, . . ,Xi-\,Xi+i,. .,,Xn) 

= P(xi,...,A:i_i,i,*,x,.(_i,..,,x„)P(xi,...,A:„)/P(A:i,...,x,_i,j:,+i,...,Xn), 

which makes the equality obvious. Provided that the process doesn't get stuck in 
some subset of the parameter space, i.e. provided it is ergodic, Gibbs sampling 
will inevitably converge to /*, . 

The kind of situation in which Gibbs sampling can get stuck is where there are 
two pieces of density which do not overlap along any of the coordinate directions, 
e.g. in the 2D case where half the density lies in the region [0, 1] x [0, 1] and the 
other half in the region {2, 3] x [2, 3]. Note that if there were even a small overlap, 
e.g. if half the density were uniform on [0, 1] x [0, 1] and the other half uniform 
—on [0.99, 1.99] x [0.99, 1.99],-then^ampling^ould pass between the two regions, 
albeit making the transition between regions quite infrequently. 
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Exercise 

11.13 What is the expected number of samples within one region, in the pre- 
ceding example, before a cross-over occurs into the other? 
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11«5 Estimation of probabilities from counts 

Above we used the example of rolling a die. We needed to estimate the param- 
eters of a multinomial from data: rolls of the die. The same abstract situation 
occurs frequently in sequence analysis, but with the number of rolls n,- with out- 
come I now meaning something different. For instance, n, might be the number 
of times amino acid i occurs in a colunm of a multiple alignment. 

Assume that the observations can be expressed as counts n, for outcome i 
(i — 1,..., A') and we want to estimate the probabilities ft for the underlying 
multinomial distribution. If we have plenty of data, it is natural to use the ob- 
served frequencies, ft = ni/N, as the estimated probabilities. Here N = Yli^i- 
This is the maximum likelihood solution, 9f^. The proof that this is so goes as 
follows. 

We want to show that P{n\9^) > P(n\e) for any 6 ^ 6^, This is equiva- 
lent to showing that log[P(n\e^)/ Pin\9)] > 0, if we only consider probability 
parameters yielding a non-zero probability. Using equations (11.3) and the defi- 
nition of 0^^, this becomes 



P(n\e^) 

P(n\e) 



log 



= log- 



= 5]]rtflog 



6 



ML 



ft 



ft^ 



ML 



= ;v^ft^^log-^>0. 

The last inequality follows from the fact that the relative entropy (11.10) is 
always positive except when the two distributions are identical. This concludes 
the proof. 

If data are scarce, it is not so clear what is the best estimate. If, for instance, we 
only have a total of two counts both on the same residue, the maximum likelihood 
estimate would give zero probability to all other residues. In this case, we would 
like to assign some probability to the other residues and not rely entirely on so 
few observations. Since there are no more observations, these probabilities must 
be determined from prior knowledge. This can be done via Bayesian statistics, 
and we will now derive the posterior mean estimator for 6. 
-As the prior we choose the EDirichlet"distribution:(lh5)- with parameters a. 
We can then calculate the posterior (11,13) for the multinomial distribution with 
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observations n: 

P{n\6)3:){e\(x) 
P{n) 

For ease of notation, we have dropped the conditioning on the model M as 
compared to (11.13), and consider all probabilities implicitly conditioned on the 
model. Inserting the multinomial distribution (11.3) for P{n\d) and the expres-. 
sion (1 1.5) for <©(0|a) yields 

In the last step Hi^r'^' "^ recognised as being proportional to the Dirich- 
let distribution with parameters n +a. Here n+a means the set of parameters 
[ni+ai] (vector addition). Fortunately we do not have to get involved with 
gamma functions in order to finish the calculation, because we know that both 
P{d\n) and S){6\n + a) are properly normalised probability distributions over 9. 
This means that all the prefactors must cancel and 

F(0|n) = 5)(0|n+a). (11.22) 

We see that the posterior is itself a Dirichlet distribution like the prior, but of 
course with different parameters. The observation that the above prefactor is one 
gives us a Uttle corollary, which will be useful later: 

Z{n+a) 

P(n) = . (11.23) 

^ ^ Z(a)M(n) 

Now, we only need to perform an integral in order to find the posterior mean 
estimator. From the definition (11.15), 

ef^ = j eiD{e\n+oi)de = z-\n+a) j 6iY[el'^'''de, (ii.24) 

We can bring 9i inside the product giving 0^'^' as the /th term. Then we see that 
the integral is exactly of the form (11.6). We can therefore write 

PME ^ Z{n+a + &i) 
Z{n+<x) 

= 1i±fl^ (11.25) 

where A — Yli and 5, is a vector whose ith component is one and all its other 
components zero. Here we have used the property (1 1:7) of the gamma function, 
i.e. r(x+ l) = Jcr(jc); this allows us to cancel all terms except ni+ai in the 
numerator and A/^ + A in the denominator. 

i This result should be compared to^eML estimate If we think of the as" 
as extra observations added to the real ones, this is precisely the ML estimate! 
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The as are like pseudocounts added to the real counts. This makes the Dirichlet 
regulariser very intuitive, and we can in a sense forget all about Bayesian statis- 
tics and think in terms of pseudocounts. It is fairly obvious how to use these 
pseudocounts: if it is known a priori that a certain residue, say number / , is very 
conmion, we should give it a high pseudocount a, , and if residue j is generally 
rare, we should give it a low pseudocount. 

It is important to note the self-regulating property of the pseudocount regu- 
lariser. If there are many observations, i.e. the ns are much larger than the as, 
then the estimate is essentially equal to the ML estimate. On the other hand, if 
there are very few observations, the regulariser would dominate and give an esti- 
mate close to the normalised as, 0i ~ a,- /A. So typically we would choose the as 
so that they are equal to the overall distribution of residues after normalisation. 

Mixtures of Dirichlets 

It is not easy to express all the prior knowledge about proteins in a single Dirichlet 
distribution; to achieve that it is natural to use several different Dirichlet distribu- 
tions. We might for instance have a Dirichlet well suited to exposed amino acids, 
one for buried ones and so forth. In statistical terms this can be expressed as a 
mixture distribution. Assume we have m Dirichlet distributions characterised by 
parameter vectors a^ . . . ,a'", A mixture prior expresses the idea that any proba- 
bility vector 0 belongs to one of the components of the mixture <2)(0|a^) with a 
probability ^fc. Formally: 



P(0|a\...,a'") = ^^,5)(0|a*), 



(11.26) 



where qk are called the mixture coefficients. The mixture coefficients have to 
be positive and sum to one in order for the mixture to be a proper probability 
distribution. (Mixtures can be formed from any types of distributions in this 
way.) Whereas this probabiUty was called P(9) in the previous section, we are 
here conditioning on the as, which was impUcit before. This turns out to be 
convenient, because we can then use probabilities like P(a^\n) below. We can 
then also identify qj, as the prior probabiUty q^ = Pia^) of each of the mixture 
coefficients. 

For a given mixture, i.e. for fixed a parameters and mixture coefficients, it is 
straightforward to calculate the posterior probabilities using the results from the 
previous section. From the definition of conditional probabilities, we have 



Pi9\n) = ^i>(0|a*,n)P(a*|n) 
k 

, - = J^P(a*|n)i)(0|n + a*), 



where we used the expression for the posterior (11.22). To compute the term 
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P{a^\n), note that by Bayes' theorem we have 



E/<7//'(n|a')' 

using qk = P(a*). The probability P(/i|q:*) is given by (11.23) (remember that 
P{n) in the previous section was implicitly conditioned on the Dirichlet parame- 
ters, so it is P(n|a*)), and we get 



9tZ(n+a*)/Z(a*) 



P(a*|«)= \ (11.27) 

The final integration to obtain can be done using the results (1 1.24) and 
(11.25) from the previous section, and yields 

9r'^ = TP(a^\n)'^. (11.28) 



The estimate using a mixture of Dirichlets is similar to using a single one: you 
just average the estimate based on each component of the mixture. However, the 
weight (11.27) with which they are averaged in the mixture is new. This weight 
is a little hard to understand intuitively, but it gives a high weight to mixture 
components with a high probability for the sample. 



Estimating the prior 

For more details of the ideas presented in the preceding section, see Brown et 
al [1993] and Sjolander et al [1996]. These authors used Dirichlet mixtures to 
model the distribution of column counts. They obtained the prior by estimating 
the mixture components and the mixture coefficients from a large dataset, i.e. a 
large set of count vectors. 

The estimation is done as follows: The mixture defines a probability for each 
count vector in the database, . . . , n^. 



P(/^^|a^...,of'";^l,...,^,) = j P(n'\9)P(e\a\...,a'^;qu...,qm)d0, (11.29) 

If the count vectors are considered independent, the total likelihood of the 
mixture is ' '"'"^ ' , ^ 

■ "'^^ ' M ' - • ■ '--^'-^'^ 

P(data|niixture) = PJi>(n'|of\...,a'";9i q^). (11,30) 

1=1 

This probability can be maximised by gradient descent or some other method of 
continuous optimisation. " " 

At this point the reader is probably asking *Why use ML estimation instead of 
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these wonderful Bayesian approaches I just learned?' To do this you just need 
a prior on the parameters of the first level of priors. You can put priors on prior 
parameters forever. At some point you have to settle for a prior you invented or 
one estimated by ML or some other non-Bayesian method. 
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11.6 The EM algorithm 

The expectation maximisation (EM) algorithm is a general algorithm for ML es- 
timation with 'missing data' [Dempster, Lakd & Rubin 1977]. The Baum-Welch 
algorithm for estimating hidden Markov model probabilities is a special case of 
the EM algoridim. For HMMs the missing data are the unknown states, since we 
only know the observations and not the sequence of states producing them. 

Assume some statistical model is determined by parameters 9, The observed 
quantities are called jc, and the probability of x is determined by some missing 
data y . For the HMM that we will treat below, 6 is the set of all model parameters 
a and e, and y represents the path through the model. The aim is to find the model 
that maximises the log likelihood 

logPix\e) = log^P{x,y\e). 

y 

Here and in the following x means all the observations whether there is one or 
more sequences. To return to the notation with all sequences shown explicitly 
requires an extra sum over sequences in all the following formulae. 

Assume now that we have a valid model, 0'. We want to estimate a new and 
better model, 6>'+^ Using P(x,y\e) = Piy\x,e)Pix\9), we can write the log 
likelihood as 

log P{x\e) = log P(x,y\d) - log P(y\xM 
Multiplying this with P(y |jc, ^0 and summing over y yields 

logP(;c!0) = ^P(>'|;c,6iOlogP(jc,j|0)-^P(y|x,0')logP(y|x,6f). 

y y 

The first term on the right we will call Q(0 \6% 

y 

We want log P{x\9) to be larger than \ogP{x\e% so the difference should be 
positive. Using the two equations above we can write the difference 



.logP(;cl0)"logP(jc|0O = 



Piy\x,e'X 
P(y\x.e) ' 
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The last term is the relative entropy (1 1.10) of Piy\x,e') relative to Piy\x,e\ so 
it is always non-negative, so 

log P{x\e) - log P{x\6') > ame') - Q{e' \e') (i 1.32) 

with equatilty only if ^ = 9', or if P{y\x,e') = P{y\x,e) for some other (9 # 6^^ 
Choosing 

e^+i=argmaxe(^|^') (1133) 

d 

will always make the difference positive and thus the Ukelihood of the new model 
larger than the likelihood of 9'. Of course, if a maximum has akeady been 
reached, 6*'+^ = 6\ and the likelihood will not change. 

The function Q in (1 1.31) is an average of log P(x,y|6>) over the distribution 
of y obtained with the current set of parameters OK This can often be expressed 
analytically as a function of 6 in which the constants are expectation values in 
the old model This will be more concrete when we go through the derivation for 
HMMs shortly. The EM algorithm is usually formulated like this: 

Algorithm: Expectation maximisation 
E-step: Calculate the Q function (11.31). 

M-step: Maximise Q{e\9^) with respect to 9. < 

We saw above that the likelihood increases in each iteration, so the procedure 
will always reach a local (or maybe global) maximum asymptotically as t 
oo. For many models, such as HMMs, both of these steps can be carried out 
analytically. If the second step cannot be carried out exactly, we can use some 
numerical optimisation technique to maximise Q, In fact, it is not necessary to 
maximise it; it is enough to make Q{9'''^\9') larger than Q{9'\9% Algorithms 
that increase Q - without necessarily maximising it - are called generalised EM 
(GEM) algorithms [Dempster, Laird & Rubin 1977]. Other generalisations of the 
EM idea can be found in Meng & Rubin [1992] and Neal & Hinton [1993]. 

EM explanation of the Baum-Welch algorithm 

For the HMM we shall now sketch the derivation of the EM steps which forms 
the Baum-Welch algorithm described m Chapter 3, p. 63. In this case we want to 
maximise the likelihood 

so the 'missing data' are die state paths n. Then Q (11.31) is given by 

■ (11.3-4)" 
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For a given path each parameter in the model will appear some number of times 
in P{x,n\0) given by the product (3.6). If it is a transition probability we will 
call this number Akiijt) and for the emission probabilities Ek{b,7t\ i.e. Ek{b,n) 
is the number of times character b is observed in state k for path jr (it depends on 
the observation sequence, which we do not show explicitly). Then we can write 
(3.6) as 

M MM 
k=i b k^Ol=l 

where the first product is over all characters b in the alphabet. After taking the 
logarithm, (11.34) can now be written as 

n 

- M MM 

^^£,(fe,;r)loge,(Z>) + 5];^Ajt/(7r)loga,/ . (11.35) 

LJt=l b k=Ol=\ 

We observe that the expected values Au and Ek(b) as defined in (3.20) and (3.21) 
on p. 64 for the Baum-Welch algorithm can be written as expectations of Akiin) 
and Eic{b,7t) with respect to P(7r\x,6^): 

Ek{b)^^P{7t\x,e')Ek{b,7t) and Au = ^P{n\x,e')Au{7i\ 
Doing the sum over tt first in (1 1.35) therefore gives 

M MM 
k=l b Jfc=0/=1 

Finally, we have to show that (3.18) maximises (1 1.36). Let us first look at the 



A term. The difference between this term for a?- = and for any other aij is 

MM Q M / \ M 0 

Jt=0 l=\ k=o\ U I 1=1 

The last expression is the relative entropy (11.10), and thus it is larger than zero 
|>. unless aki = aj^. This proves that the maximum is at aj^. Exactly the same 
procedvu*e can be used for the E term. 

For the HMM the E-step of the EM algorithm consists of calculating the ex- 
.pectatioiis E^ib) and Au, This is done by the forward-backward procedure as 
described in Chapter 3. This completely determines the Q function, and the 
I maximum is expressed directly in terms of these numbers. Therefore the M-step 
r just consists of plugging Ek(b) and Aki into the re-estimation formulae for Ckib) 
id^jH given in (3.18). ^ 
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